Summary
Building an API for real-time data is a challenging project. Making it robust, scalable, and fast is a full time job. The team at Tinybird wants to make it easy to turn a continuous stream of data into a production ready API or data product. In this episode CEO Jorge Sancha explains how they have architected their system to handle high data throughput and fast response times, and why they have invested heavily in Clickhouse as the core of their platform. This is a great conversation about the challenges of building a maintainable business from a technical and product perspective.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
- Ascend.io — recognized as a 2021 Gartner Cool Vendor in Enterprise AI Operationalization and Engineering—empowers data teams to to build, scale, and operate declarative data pipelines with 95% less code and zero maintenance. Connect to any data source using Ascend’s new flex code data connectors, rapidly iterate on transformations and send data to any destination in a fraction of the time it traditionally takes—just ask companies like Harry’s, HNI, and Mayvenn. Sound exciting? Come join the team! We’re hiring data engineers, so head on over to dataengineeringpodcast.com/ascend and check out our careers page to learn more.
- Your host is Tobias Macey and today I’m interviewing Jorge Sancha about Tinybird, a platform to easily build analytical APIs for real-time data
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what you are building at Tinybird and the story behind it?
- What are some of the types of use cases that your customers are focused on?
- What are the areas of complexity that come up when building analytical APIs that are often overlooked when first designing a system to operate on and expose real-time data?
- What are the supporting systems that are necessary and useful for operating this kind of system which contribute to the overall time and engineering cost beyond the baseline functionality?
- How is the Tinybird platform architected?
- How have the goals and implementation of Tinybird changed or evolved since you first began building it?
- What was your criteria for selecting the core building block of your platform, and how did that lead to your choice to build on top of Clickhouse?
- What are some of the sharp edges that you have run into while operating Clickhouse?
- What are some of the custom tools or systems that you have built to help deal with them?
- What are some of the performance challenges that an API built with Tinybird might run into?
- What are the considerations that users should be aware of to avoid introducing performance issues?
- How do you handle multi-tenancy in your platform? (e.g. separate clusters, in-database quotas, etc.)
- For users of Tinybird, can you talk through the workflow of getting it integrated into their platform and designing an API from their data?
- What are some of the most interesting, innovative, or unexpected ways that you have seen Tinybird used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while building and growing Tinybird?
- When is Tinybird the wrong choice?
- What do you have planned for the future of the product and business?
Contact Info
- @jorgesancha on Twitter
- jorgesancha on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a 100 dollar credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Ascend dot I o, recognized as a 2021 Gartner cool vendor at enterprise AI operationalization and engineering, empowers data teams to build, scale, and operate declarative data pipelines with 95% less code and 0 maintenance. Connect to any data source using Ascend's new flex code data connectors. Rapidly iterate on transformations and send data to any destination in a fraction of the time it traditionally takes. Just ask companies like Harry's, HNI, and Maven. Sound exciting? You can join the team. They're hiring data engineers, So head on over to data engineering podcast.com/ascend and check out their careers page to learn more. Your host is Tobias Macy. And today, I'm interviewing Jorge Sansha about Tinybird, a platform to easily build analytical APIs for real time data. So, Jorge, can you start by introducing yourself? Hello, everyone, and thanks for having me, Tobias. My name is Jorge Sanchez. I am the CEO of Tanya Bird. My background is in product and engineering, and I've been working in
[00:01:57] Unknown:
startups and data intensive products for the better part of the last 20 years. And super excited to tell you a little bit more about Tinybird today.
[00:02:07] Unknown:
And do you remember how you first got involved in the area of data management?
[00:02:10] Unknown:
Yes. Although my background has always been around web applications and so on, so data has always been a factor and scalability and making sure that applications would be able to serve all of our customers in a fast way and so on. The analytical side of it and the sort of data as a as a product, let's say, really was Encarto. Encart is a company that started out in Spain, and and now it's all over the world. They do location intelligence, and that was the first company where I got involved with data, let's say, as as key elements and analytics. And we were building a platform where customers were bringing their own data, and we were building the tools to do things with it. That was really where I understood the, you know, how complex it can be to scale applications and infrastructure up to, you know, thousands of users or thousands of customers, potentially hundreds of thousands of users and, real time analytics. And and that was a huge learning experience for me. I joined Carto as a VP of engineering to help them, expand the teams, to help them with the delivery, and to help them with processes and all of those great things. And I learned a lot in the process, and I found the cofounders, which I then started tiny bird. So I would say that, although I've always been involved with data, that was where I understood what analytics was all about.
[00:03:41] Unknown:
And in terms of what you're building at Tinybird, can you give a bit of an overview about the platform and some of the goals of the business and the story behind how you decided to go about starting it and turning it into that business? What we're trying to do at Tinybird is
[00:03:57] Unknown:
help developers build data products at any scale with huge amounts of data without having to worry about scale. And, essentially, where does this come from? It comes from for initially, some things we were seeing at Carto. Like, at Carto, every year, you know, a customer that would come the 1st year with, I don't know, a 100000 records datasets of a 100000 records. The next year, those datasets would be a 1000000 records, and the following would be 10, 000, 000, and the following would be a 100, 000, 000. Yeah. So they're growing by an order of magnitude every year, easily. And Kato, wasn't built to scale indefinitely for, you know, any use case. It was built with Postgres and PostGIS to do geospatial queries. And we found ourselves helping our customers a lot of the time with ETLs and explaining, you know, you need to transform your data in this way such that you only store in your database exactly the data that you will need for your particular use case to ensure that we could then scale those use cases to however many visitors would come to the application that person was building, to the maps they were putting out there, and that it would scale without too much trouble. And all of those ETLs and all of those transformations we found, were not very maintainable or very conducive to solving the business problems. You know what I mean? Like, we found ourselves going back to those ETLs over and over again when business requirements would change. Like, you have new dimensions you wanted to add or new filters and so on.
So that in combination with wanting to be able to deal with more data, we started looking into ClickHouse as a technology, and we understood, wow. Like, there is technology out there that would help us work with a whole different scale of data. And, actually, Carto in the end decided to go in a different direction, and they went more towards the data science aspect of it. So the geospatial data science and not so much building real time data products. The now founders of Tinybird who started each believing to go to different companies at different times, but found the same problems in those different companies, like, you know, whether it was in financial sector or the retail sector. And whenever there were huge amounts of data that needed to be joined with different dimensions and then, applications needed to be built on top of that, That was always a huge ordeal that involved cathedrals of infrastructure, different components in cloud providers in order to build things that from our developers' hearts, let's say, we thought, you know, we don't want to do all of this stuff. We just wanna focus as much as possible in solving the business problems, and we're gonna go fast. And if we work with any amount of data, how we normally work with small amounts of data, we want to do the same with large amounts of data. So that's how we started thinking about this. And as we started working on this and Harvey Santana, which is the CTO at Carto, who was the 1 that originally started thinking about this, you know, started building a prototype, and we started, you know, then we sort of incorporated the company, started working with some customers. We realized the incredible potential of real time as in solving problems and making decisions in real time, is something that we have a huge belief that's gonna be the norm in the future.
[00:07:22] Unknown:
And as you mentioned, the biggest barrier to actually being able to realize that real time decision making is the number of different processes that you have to do to be able to actually turn the data into something that's a usable format beyond just how it's maybe stored or ingested. And in terms of the actual capabilities that are unlocked by being able to accelerate the time to value value from having a piece of data, loading it into a system, and then being able to join it with other systems. What are some of the primary use cases that you and your customers are seeing from being able to actually have this capability?
[00:07:57] Unknown:
We see various use cases, but we see 2 over and over. 1, we call in product analytics. Essentially, whenever some of our customers are product companies or services companies that, you know, have a product or a service and they want and serve a large number of users or or customers. And they want to provide analytics back to them for them to understand, you know, how they're taking advantage of that product and what is the benefit of using that product. And normally, that's not their core business, that analytical part, but it's an incredible value add to bring and something that customers demand more and more. So a lot of customers come to us to say, well, this is great because it enables me to just put my data in here, build the APIs, and I just have to worry about integrating in my application. I don't have to worry about scale. I don't have to worry about any of that. So that's been something that we've done many, many times already. And the other 1 we see a lot, especially in larger, like in corporates and larger organizations, it's what we call operational analytics or operational intelligence, which is essentially real time business intelligence across your organization. So everyone, for instance, we work with a large retailer, ecommerce retailer, 1 of the largest in the world. And these guys, they have, like, an application, not just a dashboard, like, a full blown application internal product that over 600 people within the company have open in their browser all day and every day, and where they can understand how much they're selling across the world, what are their top products being sold, where they're running out of products in warehouses.
And they can understand that now in real time, thanks to tiny bird. And once they understood they could have that in real time because they used to have it, but not in real time, that's triggered a lot of other operational derivative products, let's say, like being able to automate some decisions based on that data that's coming in and being able to expose some of those decisions to the final decision maker so that they can predict what's going to be the effect of changing 1 thing, like, you know, choosing a warehouse to serve some region versus the other 1 that's maybe running out of certain products, things like that. So those 2 use cases have been something that we encounter quite a bit. And little by little, we discover some others.
[00:10:24] Unknown:
As far as actually being able to process the data that is being fed to you, what are the main sources that people are sending to you for being able to build these analytical APIs? And what are some of the complexities that you're seeing in terms of being able to actually integrate with these sources and be able to pull from them at the frequency that your customers are looking for? In terms of how we started to ingest data, we started with CSVs because
[00:10:55] Unknown:
CSVs is the international currency for any database system in the world. So when we started, we thought, well, this makes sense because everybody uses CSV in 1 way or another, and we could always fall back to CSVs if we needed to in order to integrate with with our customers. There's a lot of intelligence in our product about CSVs and about guessing the right types and about, you know, ensuring that we, for instance, parallelize import ingests when ingesting through a URL and things like that that we've made it such that it's very easy to get up and running if you're using CSVs. But then CSVs have their own problems, like, you know, data quality is still a huge problem with any customer.
Normally, we find that whenever, supposedly, there's gonna be integers in a column, you find all kinds of crap there or, you know, line endings, sometimes a different so, you know, Windows line endings versus others, things like that that we still trip us some time, and we build fixes around that. We wanna make it such that it never fails, but our customers keep surprising us with new things in their CSVs. And then we now are moving towards pretty much any system that enables us to ingest data and that lends itself to real time. Like, Kafka is now a huge focus for us because a lot of the companies we talk to are using Kafka for capturing all types of data, a lot of, you know, web events and similar and transactions and things like that. It's super simple for us to link to that and then start building APIs on top. That's a huge focus for us right now. But also, you know, we connect to other systems sort of opportunistically when, like, Snowflake or BigQuery or other data warehouses where people already have dimensional data, and they wanna bring it into tiny bird to do joins and then expose that as as APIs.
[00:12:50] Unknown:
As you mentioned, there are a number of systems that already exist for people to be able to actually track and report on data, and it's largely for internal purposes. But as they're starting to maybe try and build out something in house to expose that information for analytical APIs that are going to be consumed by other internal systems or other end users, what are some of the areas of complexity that are often overlooked or misunderstood as they start to go down that road that might ultimately lead them to want to use Tinybird rather than having to build it all in house?
[00:13:21] Unknown:
I think 1 of the things that have been key to to some of our customers is the flexibility. When you have sort of complex data pipelines, Ty touched on this at the beginning, you often have to go back to those data pipelines, and those data pipelines will become almost like a product of their own that you have to maintain over time and evolve and so on. And depending on how complex and how good you are at it and your team and so on, it might become an obstacle to solving more and more problems over time. So, you know, it's easy to understate how important it is to be able to move quickly and be able to attack new use cases and so on. With Tinybird, some of the things that are really appealing to some of our customers is that in comparison to what they were doing before is that flexibility. And the fact that once you have the data, the raw data coming into Honeybird, any new use case is at 1 SQL query away. Just prepare your SQL query, exposes it as an API end point, you can start making queries. So that is the sort of thing that when you start doing it on your own, you can find if you're not using the right tools, you can be surprised at how inflexible over time your system might become. That's 1 thing. And then the other thing as well that we pay a lot of attention to is that these data products, they require some of the things that as they grow, you also require in web applications or any type of other development like tests and continuous integration.
And, you know, you want to work with a large team of people, and you're gonna want to be able to connect all of your configurations and queries and so on to your repository so that you can see what the changes have been over time and implement those tests and so on. And all of that is something that when you do a data product for the first time, especially if it's a large 1 and so on, you're not used to thinking like that in data products, but you soon start missing if you know what you're doing. You soon start feeling, oh, I need to add tests here. I need to have some sense that I'm not gonna break anything every time that I put it in production. And that's something that through the tools that we're building around Tinybird, we want also to help our customers and users with.
[00:15:40] Unknown:
Another area of complexity might be in terms of the format that that API takes where do you focus primarily on just enabling rest APIs, or do you also look to provide GraphQL or, you know, maybe an RPC interface for being able to interact with this information? And, also, as far as monitoring and managing uptime and scalability, what are some of the additional systems that you've had to build out to be able to actually maintain the system and that, you know, somebody were to do it in house, they would end up having to build out that would distract them from the primary purpose that they're trying to deliver.
[00:16:14] Unknown:
The first part of it in terms of the APIs, we right now return JSON or CSV. But, I mean, we have a set of APIs for you to use tiny bird as a the user of tiny bird, let's say, to create endpoints, create data sources, replace data, all of those things. And then there are the APIs that you generate with tiny bird, and those are all for analytical purposes. So read only, let's say. And those are straightforward JSON and CSV that you can shape using SQL in the tiny word pipes. You have some pipes that help you work like as a as a notebook kind of interface. You can chain queries and then decide, you know, what is the result you want to expose as an API. And that can either be right now in JSON or or CSV. In the future, we plan to add other things, like, potentially, as you say, RPC, for instance, interface, WebSockets, things like that so that you can do all kinds of things. Right now, it's just straightforward JSON and CSV APIs.
And then the other question about monitoring and observability, that's a huge part of our product. And, actually, interestingly, we weren't necessarily thinking about that when we started. But because we felt that pain constantly, like, with our customers and, you know, what's going on with this customer? What why is this going slow? Why are these requests failing? Or why is this happening? We added a lot of observability, like, a whole layer of observability on top of tiny bird that we can use internally, but we expose to our customers as well. So our customers can know exactly in real time, goes using their API endpoints, how many requests are coming, you know, how many ingestion, how many rows are you ingesting, at what rate, what speed, what duration. All of that information is available for you to query with SQL, as part of a tiny bird just like you would query your own data. So you can build your own monitoring, and that's been hugely helpful for our customers. And it continues to be, like, my favorite feature as a person that works at at Tinybird that needs to make decisions.
Because for the first time in my life, like, every question I have about what's going on with the platform, I can answer it with my own product, and that's a huge boost for us. Like, if we're thinking about changing our pricing and we want to understand, you know, how many requests or how many what the amount of data, all of those things. We can just answer those questions using tiny word, which has been a huge boost for us in many ways.
[00:18:38] Unknown:
Yeah. It's definitely validating when you wanna use the thing that you're building and not just sell it to other people.
[00:18:43] Unknown:
Exactly. And, I mean, this is nothing. I'm not discovering anything new, but that kind of approach is very frustrating when things are not working as they used to, and they drive a lot of decisions. Like, even if customer's not telling you, you know, there are some things that are just plain wrong there and they need to be changed. And that's been a huge source of feedback for us.
[00:19:03] Unknown:
Digging deeper into the platform itself, can you talk about how it's architected and some of the ways that it has changed or evolved in terms of the goals and implementations since you first began working on it? We try to keep it as simple as possible. And
[00:19:17] Unknown:
for that, we try to keep the minimum set of dependencies we can. And at a high level, for instance, we try not to have any lock in with cloud providers, but we use Google at the moment. We just use, essentially, compute, so we can run tiny bit pretty much in AWS and and in Azure as well. But we don't wanna lock ourselves to any cloud provider right now because we also see a huge future of ClickHouse, and we'll talk about ClickHouse later, I'm sure, takes advantage of every CPU core you take it in. And the more the closer you are to the metal, the faster you can make it go. We're something we haven't invested a lot of time in, but we see a lot of future in using metal directly, you know, for huge use cases. But in general, that's something we keep in mind. We try to to keep as as few dependencies as possible. And I was saying at a high level, that's not blocking with, cloud providers, but at a lower level, do libraries as well. Like, before we decide to use specific library in our code, we make sure that we understand it really, really well such that we can live with it and change it and do whatever we need to do if if need be. So we don't want black boxes in that sense. And the same goes for ClickHouse itself. Like, we try to and we we spend a lot of time, and we're hiring around Berkeley cloud experts to make sure that, you know, we understand what's going on under the hood, and we can alter it and and make improvements as we go.
And then in general, so in terms of components, you know, we have some load balancing just in front of our app servers, which are written in Python with Tornado, and we use some background processes as well. We have some background processes as well, except from the app servers. And then we use Redis for metadata storage and then ClickHouse for all of the analytical queries and so on. So it's not a hugely complex setup, and that's largely how it looks like.
[00:21:16] Unknown:
And I know that a number of people are actually using ClickHouse for some of the metrics type data in their systems as well for being able to collect logs and manage the infrastructure time series data. Are you able to actually use ClickHouse for your monitoring as well as the product?
[00:21:31] Unknown:
Yes. Basically, we collect, you know, all of the stats. For instance, when we're working on a public web page, it's gonna show all of our traffic in real time through tiny bird API endpoints. So you can see, you know, as traffic is coming in, you're gonna be able to see it at scale because we drop all of that back into our, yeah, ClickHouse server that has tiny bird on top, and we can quickly analyze it and so on. So yeah. Yeah. We we use it for everything.
[00:22:00] Unknown:
And you mentioned that it's sort of the core building block, and that it's something that you found when you were working at Carto. But as you were revisiting this idea of I wanna be able to deliver real time APIs as a service to other people, what were your overall criteria in terms of the decision making? Did you think about anything else besides ClickHouse, or was it just ClickHouse all the time and you you knew that going in? And just curious sort of what the decision process was as you were deciding to stake your business on this piece of technology.
[00:22:31] Unknown:
We fell in love with ClickHouse first, and then we started looking at alternatives to see, is this the best out there? But there's a lot of good reasons for us to use ClickHouse. First 1, if we're thinking about, you know, enabling customers to scale to whatever they need to scale, Kickass is the absolute best out there. It's super scalable, both horizontally and vertically. Another huge aspect of it is that it uses SQL, you know, and SQL is, like, you know, lingua franca, let's say, for database systems and developers all around the world no use and no SQL, so they don't have to learn anything new to use tiny bird. Then, you know, it's really, really good. The queries are have super low latency, which if you're trying to build a real time product, you need. And that's the problem with trying to do that with systems like BigQuery or Snowflake or Redshift, which are great at running queries over huge amounts of data, but you always have, you know, some latency that makes it really hard to scale vertically, let's say. If you're gonna hit that system with 100 of queries per second, first, you're gonna have to throw a lot of money to solve that problem so that you can scale that up in terms of servers and so on. And, you know, building so solving real time with endless money, it's possible, you know, but the key is to solve it with at a budget. And that makes sense. So ClickHouse allows you to have super low latency, and also a great ingestion rate. Like, it can ingest incredibly fast, and that's what we consider real time. You need both. You need to be able to ingest the data really fast, and you need to be able to query really fast. Otherwise, you know, if you're adding a lot of delay on 1 side or the other, you're basically moving away from real time, and and that was a key factor for us. And then thing I mentioned a bit earlier, it has state of the art algorithms in the sense that it's built to take advantage of every single CPU core you throw at it, and that's why it is so able to scale. And, also, super important from the business point of view is a super well maintained open source project. And as a thriving community, it's adding more and more people to it every day. And for us, that was also a key factor.
[00:24:53] Unknown:
And as you have been using ClickHouse in earnest for a while now, and you've been testing it at various levels of scale. What are some of the sharp edges that you've run into while running it and some of the other custom tools or patches that you've made to it or systems that you've built around it to be able to help deal with some of those challenges?
[00:25:13] Unknown:
In a general sense, it's something you keep in mind is that there's a relatively large code base, not huge click up, but it's large enough that, you know, there's a lot of corners and areas of code to explore and understand. And it's really well written and it's easy to follow, but sometimes for certain things, you really need to understand it in detail to know what's going on and to be able to troubleshoot something that might be happening. We always say that Kikos is like a Formula 1. You know, if you know what you're doing and you have a team of people that understand it really well and, like, you know, in the case of Formula 1 would be the mechanics and the engineers and so on, you know, and you have been driving for a long time, you know, you can make it run at 300 kilometers per hour. But what we're trying to do is that anyone can run it at that speed. And any driver, let's say, any any developer. So that's something to keep in mind. And there's been things that, you know, we've learned in the process, like, you know, there's some data management oddities, I would say, that if you think about it, make all the sense in the world, but you don't expect that to work like that when you use ClickHouse for the first time. For instance, any deletion you want to do, you need a lot of disk space to do deletion. As in because in order to delete data, let's say you have a partition that you just want to delete a month of data and you're partitioning by year or something like that. What ClickHouse will do is that it will copy the partition without the bit that you want to delete.
And then once that's finished, it will get rid of the old you know, do the swap, get rid of the old partition, more or less. So we found some customers wanted to do some massive deletion and they couldn't. Or for instance, they had a TTL, like, time to live in their data source table, let's say, such that certain data needed to be deleted after a day or something like that, like 24 hour. But they were ingesting such huge amounts of data that it was, you know, terabytes of data. And when it was going to get deleted, we couldn't because there wasn't enough disk. We hadn't realized that it was gonna be so big and the TTL would not be able to work. And that caused us to scramble, extend the disk, you know, those kinds of things. So those things is things that we didn't expect. Even if you understood the problem well before designing the system, very easy to overlook, you know, very easy to have a customer that you're basically allowing to ingest any amount of data to realize that that might be a problem. So and then things like, you know, for instance, we found a bug also that whenever you're doing replication, Kaka is really good at replicating data and keeping all replicas up to date. But whenever you do a command that requires service synchronous ACK from all of the replicas, like, just so that you're sure that all of the replicas are up to date with a certain command, like an optimized or something like that, There was a bug that it would wait for all the replicas even if that replica had died. You know, if if that, replica had crashed, would wait for all of the replicas to confirm that the command was finished and basically wait forever until someone would realize and restart that replica exactly in the same way. So things like that, we found along the way and when we found we provided a patch or something. And, I mean, the ClickHouse core team is amazing at, you know, accepting those patches and putting them into the master brands and so on. And, yeah, there's other problems like, you know, the loading of the tables. You need to bear in mind that it's alphabetical, and so if there are any dependencies between tables, like a joint table or something like that, and you try to load that table before the other 1, then it will crash, you know, so we'd have to build something to ensure that the tables are loading the right order and, you know, that we don't have those kinds of problems. So we found a lot of things over time that, you know, you don't think about when you're building a system that needs to serve potentially thousands of customers, and that's something that we've learned a lot about. In terms of the actual
[00:29:22] Unknown:
life cycle management, you were mentioning that for deletions, you have to have double the disk space roughly. And I'm wondering if you can talk a bit more about in terms of your platform and the user experience of do you allow people to just maintain data in perpetuity? Do you have sort of an enforced life cycle policy? What sort of tuning is available for people to be able to determine at what cadence should data expire, and what happens to it after it's expired?
[00:29:52] Unknown:
So all the data that you have in your Tinybird account, we consider it hot data as in subject to be queried and needing to be available in a low latency fashion, let's say, at any time. So we don't have a concept of sort of cold data or hot data. Everything is hot data and and subject to be great at any time. And in that sense, right now, our pricing revolves around storage and concurrency. At least right now, you need to think a little bit about what you wanna do. And for large use cases, we help our customers serve design their system and recommend, okay, you this is what you're gonna need and so on. But right now, customers have full control of basically what TTL they want to establish. You know, we help them work with that. And then we are very alert about potential problems, especially because we've learned sometimes even if it looks like they have plenty of space or, you know, suddenly, a few customers can be doing different things at the same time and then cause a problem. So we keep a very close eye with alerts and our own observability layer to make sure that if some customer needs to be warned, hey. You need to be careful here because you're gonna run out of space or you need to upgrade or you need to change your policies, then we do it on on a sort of a reactive basis, let's say. We are moving towards making it completely customer driven, as in I want more speed.
That sounds very movie like, but because we know how the query is built, and we know how many CPU cores that particular query is using to return. And we can tell you, hey. If you want this query to go faster, we can do that. You know, click here to upgrade or click here to get more speed. And we can tell you in advance how fast that query potentially could go. So those are things we're very interested to explore. And the same for disk space, you know, that you can see when you might be running into problems and that you can extend it yourself without any help from us.
[00:31:58] Unknown:
RudderStack's smart customer data pipeline is warehouse first. It builds your customer data warehouse and your identity graph on your data warehouse with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and Zendesk help you go beyond event streaming. With RudderStack, you can use all of your customer data to answer more difficult questions and send those insights to your whole customer data stack. Sign up for free at dataengineeringpodcast.com/rudder today.
And you mentioned that you have the query analyzer to be able to understand. This is the potential speed that you could get trying to answer this question. And as people are designing their APIs and the queries that they're trying to, you know, drive the API from, what are some of the performance issues that they might run into or some of the challenges that they might have in terms of being able to actually formulate the query in a way that makes sense for them being able to deliver in an API?
[00:33:00] Unknown:
We try to be, fast by default, as in we want the experience to be really, really good from the beginning such that when you upload data or you connect it to a Kafka topic or something like that, you can start working with the data and everything is fast and so on. But, obviously, we're not putting at least not now. We'll see in the future if there are some things we need to block. But, you know, anyone can write a slow query if they set their mind to it. You know? So that's something that, you know, you can't help. So there's always gonna be cases where someone comes in and writes a slow query, especially when you have huge amounts of data. But we try to make a lot of decisions in advance for the customer that you can always then for the user, you can always then change. Like, for instance, how the data is ordered in the disk is really important for performance.
So based on the data that we see coming in, we make those decisions upfront so that, you know, we try to have smart defaults for our customers, for users. And then the same goes for there's a number of things that if you've never worked with analytical databases, you probably don't know or you're not used to thinking about in that way with columnar databases in particular. Like for instance, when you're doing massive amounts of data, you wanna make sure that you think about how you join the data intelligently. And the first typical thing is, hey. You're gonna join and, well, filter first and then join such that you reduce the amount of data that you then need to join. So those kinds of things, we are starting to, little by little, add functionality in the product that when we detect that type of pattern, we're gonna say, hey. You should change your query like this, It'll go faster. You know? And those kinds of things because we see a lot of different use cases in modern customers and the type of queries that they can do. You know, we can build functionality to help our customers, you know, write better SQL, if you know what that means. And even at times, just not say anything and change the query on the fly before it comes back. So if you're used to writing Postgres, SQL or MySQL SQL that, you know, it'll work and it'll be fast even if it's not ideal. So we can't always do that, but there are certain cases that it's just a question of understanding how the query is structured and and do the change in the background if need be. Those are some of the things that is worth learning. And, by the way, we did a real time analytics course when we were getting started to to get leads and so on. And a lot of people signed up, and it was all about those kinds of things. It's what types of things you need to bear in mind when you're writing queries over huge amounts of data.
[00:35:42] Unknown:
In terms of the actual end user experience, there are a couple of things that are probably worth digging into. 1 being the actual data loading because as you were mentioning, the way that you structure your queries is influenced by how you load your data and the structure there. And then the other aspect of that is schema evolution where the data source changes, and then you need to be able to reflect either a new column or a changed data type in the ClickHouse cluster. And so I'm wondering if you can just talk through the overall workflow of somebody who's getting started with Tinybird and building an API and how those data loading considerations factor in. You've touched on something that we always have a feeling that if we get right, we're gonna take on the world because
[00:36:24] Unknown:
it's something that it's really challenging to do when you have a lot of this is going back to the Formula 1 thing. You know? We want that to be as easy as possible without really necessarily understanding what's going on. And things like adding columns, you know, just adding a column is not that problematic, but, you know, changing the schema or changing the order of the data or doing those kinds of things, someone will require you to recreate those tables and so on, and that is a pain. So with Tinybird, we have not just the UI where you can use the browser and write your queries and so on. We also have a command line interface. You can think of it as a type of git like command line interface that allows you to pull all your schemas and all your queries into text formats, and that you can work on your IDE or Visual Studio or your text editor or whatever, and then upload that to GitHub and work collaboratively and then push back into tiny bird. And we started with the CLI to build versioning and both for data sources and pipes or API endpoints, such that if you have a data source that, you know, maybe has already huge amounts of data and you already have an ingestion coming in or several different points of ingestion for the same data source and you want to add columns or you wanna change something or the order or whatever, you can create a new version of that data source.
And you can do it in such a way that all the data from the initial data source will pass on to the new version. And whenever you're ready, you can just point the pipes to the new version, and then your APIs will not notice anything. And you can continue either ingesting in the old data source and the data will fall sort of, flow through to the new 1 or start ingesting to the new 1. So that's how our versioning system works right now. But we've realized that sometimes when you're starting especially and you still don't know and you're still playing with the data and so on, there's a number of things you just want to do and you want that to be as simple as possible. So we're adding things like adding a column straight away, without having to version or anything like that. And in a way that won't break your existing ingestion because that's something you have to bear in mind. If you're ingesting a CSV that has 6 columns and suddenly you add 2 new columns or you remove a column, you know, what happens with the ingestion that's coming in? So we're doing it in such a way that it won't upset any ingestion process or anything like that, and you can evolve that much quicker. And then if you want to have versions, you can also do that. So that's how it works right now. It's 1 of the things that nobody realizes at the beginning.
And then when you start having a big project and so on, it becomes a thing that you need to master.
[00:39:16] Unknown:
Another interesting aspect of the platform you're building is how you manage multi tenancy where, for instance, you mentioned customer who had terabytes of data that they wanted to drop a month's worth. And so now all of a sudden, you're running into disk space issues. And, obviously, you don't want that to impact other people who maybe have a smaller volume of data where they're working with gigabytes of you know, per month and then you know, but they just wanna have a simple API. And so I'm curious how you're managing that multitenancy where do you have dedicated clusters for each customer? Do you have some customers who are on a larger plan who have a dedicated cluster, and then others are on a shared cluster where you have quotas established for usage of the ClickHouse cluster and just how you sort of implement all of that and build it in a way that you're actually able to make it maintainable without tearing your hair out.
[00:40:01] Unknown:
That's that's a good question. So we have 2 kinds of accounts. We have both sort of shared resources, kinds of accounts, like Shahjes, those purely multi tenant, and we also have dedicated resources. So the shared approach, basically, that means everything is shared. Like, you share load balancing, you shared web applications, you share, you know, the infrastructure is, is the same. And then in terms of kick house, we have, you know, it's a the churn infrastructure is a huge cluster with multiple databases and 1 database per customer, let's say. Each cluster can have multiple instances and those instances within the same network. So, you know, we can scale that up as needed. And then, basically, that's how the shared approach works. And then in the case of ClickHouse, it's slightly different for other databases. Like, a cluster can have multiple ClickHouse instances, and then each instance can have multiple databases.
And then those are just essentially collection of tables with their own users and security and so on. So that's how we do sort of the shared approach. And then the dedicated approach can be 1 of 2. 1 is fully dedicated, and that's your own tiny bird, basically. It's just everything. No nothing shared at all. We do that for large customers that basically want no split whatsoever. Not necessarily no sharing whatsoever, and that's fine with us. And we also, in those cases, explore with our customers whether it's our cloud or their cloud. Sometimes they prefer to pick up the bill, let's say, even if we operate for them. That's something we're very open to. And then the other dedicated approach is a mixed 1, which is you have your own dedicated database. That's where you're going to see the performance improvements and where you don't want any fighting amongst resources.
And then they share some of the common infrastructure like, you know, load balancers and maybe some web applications and so on. But, really, when we talk about dedicated resources and high availability and so on, the key of that is on the database on the ClickHouse instances, and we can very easily set that up for any customer.
[00:42:13] Unknown:
And as you have been building out tiny bird and working with your customers and using it for being able to, you know, build APIs for monitoring tiny bird to build tiny bird. What are some of the most interesting or innovative or unexpected ways that you've seen it used?
[00:42:27] Unknown:
I mean, recently we've, a customer worked on a use case that we're in love with. A customer of ours is a platform service, type of business, and they have a huge CDN, edge servers, let's say, around the world. And they've built with Tinybird their own WAF, like their own web application firewall. Essentially, how that works is that all of the logs from all of those edge servers are being sent to a Kafka instance from which we ingest. And then they built, you know, from the moment that a request comes in into 1 of those edge servers where it's available in tiny bird, is maybe 2, 3 seconds. It's really, really fast. You consider that's sort of an average across the world. And then they build some API endpoints such that each edge server, every 5 or 10 seconds, queries the incoming data. So sends a request to tiny bird to see if there's any specific IPs that are generating a huge peak of traffic.
And if that's the case, they will cut that the traffic from that IP to avoid it in out of service attack kind of thing. So that's been something we were not thinking about at all, and, the customer of ours first built a different use case and then think, actually, would this scale to do this? And they managed to do it super good, and they managed to do it really, really quickly. And it forced us to be creative about certain things we hadn't thought about in terms of how we'd ingest the data faster and so on. Because think about when there's a denial of service attack, you know, we're ingesting maybe, you know, 40, 000 records per second on average. But when there's a denial of service attack, maybe that goes up to 3 times that or more. And even if it's just 1 server, it'll be hitting us with a lot of requests, and we need to make sure we can keep up because otherwise, it will defeat the purpose of what they're trying to do. So it's forced us to improve some areas of the product, like ingestion and from Kafka to give it even higher bandwidth that we had at the beginning.
But we absolutely love that use case.
[00:44:33] Unknown:
In terms of your own experience of building out tiny bird and growing the business and helping your customers, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:44:43] Unknown:
From a business point of view and in general, something we believe very strongly in and we've added to our company principles and something we say, which is speed wins because speed sort of applies in every aspect. Your speed is the definitive differentiator, however you look at it. You know, the business level, you make decisions faster, then things move faster, you can do more things, you can resolve more problems. And a lot of the decisions, most of 90% of the decisions are reversible. So many times it's much better to just make a decision than to hang around and think and ask and bother a lot of people, you know, because a lot of the times, those decisions, it just don't matter. Just go ahead and do it quickly and then let's see the result rather than, you know, trying to get paralyzed by the analysis. And then the same goes for technology and infrastructure. If your product is faster, you'll need less infrastructure to serve the same use cases, which means less costs and better business.
But the same is for the user experience. The faster the results, the happier the customers would be, you know, the more they'll talk about tiny bird. And so it's something that we already thought, but we've seen it in such clear way that we've made it like a company principle. And and, you know, you can see it in our chats often is like someone will ask something, hey, do you hey, what do you think? It's just like, the answer is oftentimes speed wins, you know, which means you decide. It doesn't, you know, it doesn't really matter. It's not that it doesn't really matter. You know, it's probably better to just quickly decide and move forward than to hang around with this and so on. That's been 1. And then another 1 from a point of view of business as well is, you know, we have grown convinced about real time being something that will be the norm in the future, and that's a huge thing for us because I think a lot of companies live with batch processing and times as a necessary evil, like, something that that's the way it is. It's like background noise, something that you don't necessarily notice until someone switches that background noise off. And when you realize that some of the things you can do in real time, you start thinking, wow. What else can I do in real time? You know? And that opens up a lot of new ways of thinking about your business and opportunities about how to do things, and that's been a huge discovery for us. Yeah. Those are couple of ones on a more pragmatic basis. Something we've learned is that data is always dirty.
So whenever you think that, you know, yeah, just start ingesting and blah blah blah, and, you know, it's very easy, then there's always problems with data that, you know, we assume it's never as easy as it looks with data, especially when you have huge amounts of it. But, yeah, those are some of the lessons that we've learned over time. And for people who want to be able to deliver analytical APIs
[00:47:40] Unknown:
and do it at high speed, what are some of the cases where a tiny bird is the wrong choice and they might be better off either building something internally or using some other off the shelf product or system?
[00:47:50] Unknown:
Obviously, this is purely analytical, so it's anything resembling transactional use cases. This is not the right product for, you know, either just using Postgres or MySQL or, you know, any transactional database or new services that are coming out now that are databases as a service but geared towards the transactional use cases is would be a better choice. And then for use cases like point queries, you know, it's not that they never would be the wrong choice, but we are not particularly better than other systems. You know? This is I mean, if we have some of those and it's really fast by any means, but not necessarily faster than any other system. So that and key values, in general, it's not ideal. It's anything that's time series and and so on is great and where you can throw huge amounts of data and so on. And we talked about in product analytics at the beginning, another reason why tiny bird is great for that is that you only do queries by company ID, let's say, or customer ID that enables you to limit the amount of data that you're querying just by default, which even if you have huge amounts of data in total, you're always gonna be querying for a particular company ID. You know, those are the kinds of things that really make sense. But, yeah, transactional use cases or point queries, you know, those are not the ideal use cases for Honeywell.
[00:49:16] Unknown:
And as you continue to build out the product and the business, what are some of the things that you have planned for the near to medium term? 1 of the things that happened with Honeywell is that our first customer was a huge customer,
[00:49:28] Unknown:
and we were, like, we're just 5 or 6 people, and we were thrown into meaning to deliver that use case. And that forced us to focus on scalability and reliability and performance, but not necessarily in the user experience because we were basically running the show and sort of building out the product and the solution, let's say. But the future for us is in enabling developers. So we want to make this so easy to use for any developer that they just don't think about anything else And developer or data engineers out there so 1 huge sort of change in terms of focus has been that over the last few months. So we're investing a lot in making this super easy to use, to connect to any data source, to start ingesting, to start building queries, to build APIs.
So that's something that we're gonna do more and more. And that also means adding the toolset that developers are comfortable using. That's why we've added a command line interface. That's why, you know, we make it really easy to integrate with GitHub, all of those things. Apart from that, we are super focused on high frequency ingestion, high concurrency type of use cases. That's where we see the market going more and more, and that's where we're really good at and where we can scale to to handle pretty wild use cases. And that's where our main focus is gonna be for the foreseeable future. And then in terms of the business, the company was born in Spain, and we have now starting to have customers all over the world, and we're gonna start, you know, expanding into those territories like the US and the UK and and the rest of the Europe over the next few months. So we're gonna start making a lot more noise.
[00:51:16] Unknown:
Well, for anybody who wants to follow along with the work that you're doing and keep in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. You know, 1 of the areas where we see
[00:51:33] Unknown:
a lot of opportunity is in the sort of data ops aspect of building applications and data products. If you think about web development over the last 15 years, a lot of people have learned and take for granted things like continuous integration and, you know, testing, test driven development, things like that. And that's something that we don't see that as much with our customers. Like, they don't have that kind of approach when it comes to data products. And I think there's a huge opportunity there and something that hopefully we can help drive from Tinybird to sort of establish the right way to build data products. You know? What are the best practices and what are the right types of tools to do that. And that for us is something that we miss when we build data products and that has sort of guided us towards adding certain functionality to the product that we didn't expect initially we would need from the point of view of, you know, being able to automate certain things, like testing, doing checks automatically when you're pushing new endpoints, to the system, enabling, you know, continuous integration in an easy way and integration with a source code repository, all of those things.
I think it's a big gap and something that teams need to build by themselves a lot, and it takes time and learning and so on. And that's something that when we've had to run big data products, it's the first thing we think about because we know we're gonna run into that soon enough. So we start with that, and we wanna sort of to see how we can help developers learn how to do that, not just with the web development products, but with their data products as well.
[00:53:16] Unknown:
Well, thank you very much for taking the time today to join me and share the work that you're doing at Tinybird. Certainly very interesting project and 1 that suits a big need in the overall ecosystem. So I appreciate all the time and energy you've put into that, and I hope you enjoy the rest of your day. Thank you very much for having me. It's been great. Listening. Don't forget to check out our other show, podcast.init@python podcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at data engineering podcast dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Jorge Sanchez: Introduction and Background
Overview of Tinybird and Its Goals
Use Cases and Capabilities of Tinybird
Data Sources and Integration Challenges
Platform Architecture and Evolution
Challenges and Customizations with ClickHouse
User Experience and Query Optimization
Managing Multi-Tenancy and Customer Resources
Interesting Use Cases and Customer Stories
Lessons Learned and Company Principles
Future Plans and Focus Areas
Biggest Gaps in Data Management Tooling
Closing Remarks and Contact Information