Making Analytical APIs Fast With Tinybird

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a 100 dollar credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Ascend dot I o, recognized as a 2021

Gartner cool vendor at enterprise AI operationalization

and engineering,

empowers data teams to build, scale, and operate declarative data pipelines with 95%

less code and 0 maintenance.

Connect to any data source using Ascend's new flex code data connectors.

Rapidly iterate on transformations and send data to any destination in a fraction of the time it traditionally takes. Just ask companies like Harry's, HNI, and Maven.

Sound exciting? You can join the team. They're hiring data engineers, So head on over to data engineering podcast.com/ascend

and check out their careers page to learn more. Your host is Tobias Macy. And today, I'm interviewing Jorge Sansha about Tinybird, a platform to easily build analytical APIs for real time data. So, Jorge, can you start by introducing yourself? Hello, everyone, and thanks for having me, Tobias. My name is Jorge Sanchez. I am the CEO of Tanya Bird. My background is in product and engineering, and I've been working in

startups and data intensive products for

the better part of the last 20 years. And super excited to tell you a little bit more about Tinybird today.

And do you remember how you first got involved in the area of data management?

Yes. Although

my background has always been around web applications and so on, so data has always been a factor and scalability

and making sure that applications would be able to serve all of our customers in a fast way and so on.

The analytical

side of it and the sort of data as a as a product, let's say,

really was Encarto. Encart is a company that started out in Spain, and and now it's all over the world. They do location intelligence, and that was

the first company where I got involved

with data, let's say, as as key elements and analytics.

And we were building a platform where customers were bringing their own data,

and we were building the tools to do things with it. That was really where

I understood

the, you know, how complex

it can be

to scale applications and infrastructure

up to, you know, thousands of users or thousands of customers, potentially hundreds of thousands of users

and, real time analytics. And and that was a huge learning experience for me. I joined Carto

as a VP of engineering to help them, expand the teams, to help them with the delivery, and to help them with processes and all of those great things. And I learned a lot in the process, and I found

the cofounders, which I then started

tiny bird. So

I would say that, although I've always been involved with data, that was where I understood

what analytics was all about.

And in terms of what you're building at Tinybird, can you give a bit of an overview about the platform and some of the goals of the business and the story behind how you decided to go about starting it and turning it into that business? What we're trying to do at Tinybird is

help developers

build

data products

at

any scale with huge amounts of data without having to worry about scale. And, essentially, where does this come from? It comes from for initially,

some things we were seeing at Carto. Like, at Carto, every year, you know, a customer that would come the 1st year with, I don't know, a 100000 records datasets of a 100000 records. The next year, those datasets would be a 1000000 records, and the following would be 10, 000, 000, and the following would be a 100, 000, 000. Yeah. So they're growing by an order of magnitude every year,

easily. And Kato,

wasn't built to scale indefinitely for, you know, any use case. It was built with Postgres and PostGIS

to do geospatial queries. And

we found ourselves

helping our customers

a lot of the time

with ETLs and explaining, you know, you need to transform your data in this way

such that you only store in your database

exactly the data that you will need for your particular use case to ensure that we could then scale those use cases

to however many

visitors would come to the application that person was building, to the maps they were putting out there, and that it would scale

without too much trouble. And all of those ETLs and all of those transformations we found,

were not very

maintainable or very conducive

to solving

the business problems. You know what I mean? Like, we found ourselves

going back to those ETLs over and over again when business requirements would change. Like, you have new dimensions you wanted to add or new filters and so on.

So that in combination with wanting to be able to deal with more data, we started looking into ClickHouse as a technology,

and we understood, wow. Like, there is technology out there that would help us

work with a whole different scale of data.

And, actually, Carto in the end decided to go in a different direction, and they went more towards

the data science aspect of it. So the geospatial data science and not so much building real time data products.

The now founders of Tinybird who started each believing to go to different companies at different times, but found the same problems in those different companies, like, you know, whether it was in financial sector or the retail sector. And

whenever there were huge amounts of data that needed to be joined with different dimensions

and then,

applications needed to be built on top of that, That was always a huge ordeal

that involved cathedrals of infrastructure,

different components in cloud providers

in order to build things that from our developers'

hearts, let's

say, we thought, you know, we don't want to do all of this stuff. We just wanna focus as much as possible in solving the business problems, and we're gonna go fast. And if we work with any amount of data, how we normally work with small amounts of data, we want to do the same with large amounts of data. So that's

how we started thinking about this. And as we started working on this and Harvey Santana, which is the CTO at Carto,

who was the 1 that originally started thinking about this, you know, started building a prototype, and we started, you know, then we sort of incorporated the company, started working with some customers. We realized

the incredible potential of real time as in solving problems and making decisions in real time,

is something that we have a huge belief that's gonna be the norm in the future.

And as you mentioned, the

biggest barrier to actually being able to realize that real time decision making is the number of different processes that you have to do to be able to actually turn the data into something that's a usable format

beyond just how it's maybe stored or ingested.

And in terms of the actual capabilities that are unlocked by being able to accelerate the time to value value from having a piece of data, loading it into a system, and then being able to join it with other systems. What are some of the primary use cases that you and your customers are seeing from being able to actually have this capability?

We see various use cases, but we see 2 over and over.

1, we call in product analytics.

Essentially, whenever some of our customers

are product companies or services companies

that, you know, have a product or a service and they want and serve a large number of users or or customers. And they want to provide analytics

back to them for them to understand,

you know, how they're taking advantage of that product and what is the benefit of using that product. And

normally, that's not their core business, that analytical part, but it's an incredible value add to bring and something that customers demand more and more. So a lot of customers come to us to say, well, this is great because it enables me to just put my data in here, build the APIs, and I just have to worry about integrating in my application. I don't have to worry about scale. I don't have to worry about any of that. So that's been something that we've done many, many times already.

And the other 1 we see a lot,

especially in larger, like in corporates and larger organizations, it's what we call operational analytics or operational intelligence,

which is essentially

real time business intelligence

across your organization. So

everyone, for instance, we work with a large retailer,

ecommerce retailer, 1 of the largest in the world. And these guys,

they have, like, an application, not just a dashboard, like, a full blown application internal product

that

over 600 people within the company have open

in their browser

all day and every day, and where they can understand how much they're selling

across the world,

what are their top products being sold, where they're running out of products in warehouses.

And they can understand

that now in real time, thanks to tiny bird. And once they understood they could have that in real time because they used to have it, but not in real time,

that's triggered a lot of other operational

derivative products, let's say, like being able to automate some decisions

based on that data that's coming in and being able to expose some of those decisions to the final decision maker so that they can predict what's going to be the effect of changing 1 thing, like, you know, choosing a warehouse

to serve some region versus the other 1 that's maybe running out of certain products, things like that. So those 2 use cases have been something that we encounter

quite a bit. And little by little, we discover

some others.

As far as actually being able to

process the data that is being fed to you,

what are the main sources that people are sending to you for being able to build these analytical APIs? And

what are some of the complexities that you're seeing in terms of being able to actually

integrate with these sources and be able to pull from them at the frequency that your customers are looking for? In terms of how we started to ingest data, we started with CSVs because

CSVs is the

international

currency

for any database system in the world. So when we started, we thought, well, this makes sense because everybody uses CSV in 1 way or another, and we could always fall back to CSVs

if we needed to in order to integrate with with our customers. There's a lot of intelligence in our product about CSVs and about guessing the right types and about, you know, ensuring that we, for instance, parallelize

import ingests when ingesting through a URL and things like that that we've made it such that it's very easy to get up and running if you're using CSVs.

But then CSVs have their own problems,

like, you know, data quality

is

still a huge problem with any customer.

Normally, we find that whenever, supposedly, there's gonna be integers in a column, you find all kinds of crap there or, you know, line endings, sometimes a different so, you know, Windows line endings versus

others, things like that that we still trip us some time, and we build fixes around that. We wanna make it such that it never fails, but our customers keep surprising us with new things in their CSVs. And then we now are moving towards

pretty much any

system that enables us to ingest data and that lends itself to real time. Like, Kafka is now a huge focus for us because a lot of the companies we talk to are using Kafka for capturing all types of data, a lot of, you know, web events and similar and transactions and things like that. It's super simple for us to link to that and then start building APIs on top. That's a huge focus for us right now. But also, you know, we connect to other systems sort of opportunistically when,

like, Snowflake or BigQuery or other data warehouses where people already have dimensional data, and they wanna bring it into tiny bird to do joins and then expose that as as APIs.

As you mentioned, there are a number of systems that already exist for people to be able to actually track and report on data, and it's largely for internal purposes. But as they're starting to

maybe try and build out something in house to expose that information for analytical APIs that are going to be consumed by other internal systems or other end users, what are some of the areas of complexity that are often overlooked or

misunderstood as they start to go down that road that might ultimately lead them to want to use Tinybird rather than having to build it all in house?

I think 1 of the things that have

been key to to some of our customers is the flexibility.

When you have

sort of complex data pipelines,

Ty touched on this at the beginning, you often have to go back to those data pipelines, and those data pipelines

will become

almost

like a product of their own that you have to maintain over time and evolve and so on. And depending on how complex and how good you are at it and your team and so on, it might become an obstacle to solving more and more problems over time. So, you know, it's easy to understate

how important it is to be able to move quickly and be able to attack new use cases and so on. With Tinybird, some of the things

that are really appealing to some of our customers is that in comparison to what they were doing before is that flexibility. And the fact that once you have the data, the raw data coming into Honeybird,

any new use case is at 1 SQL query away. Just prepare your SQL query, exposes it as an API end point, you can start making queries. So that is the sort of thing that when you start doing it on your own, you can find if you're not using the right tools, you can be surprised at how inflexible over time your system might become. That's 1 thing. And then the other thing as well that we pay a lot of attention to is that these data products,

they require some of the things that

as they grow, you also require in web applications or any type of other development like tests and continuous integration.

And, you know, you want to work with a large team of people, and you're gonna want to be able to connect all of your configurations and queries and so on to your repository so that you can see what the changes have been over time and implement those tests and so on. And all of that is something that when you do a data product for the first time, especially if it's a large 1 and so on, you're not used to thinking

like that in data products, but you soon start missing if you know what you're doing. You soon start feeling, oh, I need to add tests here. I need to

have some

sense that I'm not gonna break anything every time that I put it in production. And that's something that through the tools that we're building around Tinybird,

we want also to help our customers and users with.

Another

area of complexity might be in terms of the format that that API takes where do you focus primarily on just enabling rest APIs, or do you also look to provide GraphQL or, you know, maybe an RPC interface for being able to interact with this information?

And, also, as far as monitoring and managing uptime and scalability, what are some of the additional

systems that you've had to build out to be able to actually maintain the system and that, you know, somebody were to do it in house, they would end up having to build out that would distract them from the primary purpose that they're trying to deliver.

The first part of it in terms of the APIs,

we right now return JSON or CSV. But, I mean, we have a set of APIs for you to use tiny bird as a the user of tiny bird, let's say, to create endpoints, create data sources, replace data, all of those things. And then there are the APIs that you generate with tiny bird, and those are all for analytical purposes. So read only, let's say. And those are

straightforward JSON and CSV that you can shape using SQL in the tiny word pipes. You have some pipes that help you work like as a as a notebook kind of interface. You can chain queries

and then decide, you know, what is the result you want to expose as an API. And that can either be right now in JSON or or CSV. In the future, we plan to add other things, like, potentially,

as you say, RPC, for instance, interface, WebSockets, things like that so that you can do all kinds of things. Right now, it's just straightforward

JSON and CSV APIs.

And then the other question about monitoring and observability,

that's a huge part of our product. And, actually, interestingly,

we weren't necessarily thinking about that when we started.

But because we felt that pain constantly, like, with our customers and, you know, what's going on with this customer? What why is this going slow?

Why are these requests failing? Or why is this happening?

We added a lot of observability,

like, a whole layer of observability

on top of tiny bird that we can use internally, but we expose to our customers as well. So our customers can know exactly

in real time,

goes using their API endpoints,

how many requests are coming, you know, how many ingestion, how many rows are you ingesting,

at what rate, what speed, what duration. All of that information is available for you to query with SQL, as part of a tiny bird just like you would query your own data. So you can build your own monitoring, and that's been hugely helpful for our customers. And it continues to be, like, my favorite feature

as a person that works at at Tinybird that needs to make decisions.

Because for the first time in my life, like, every question I have about what's going on with the platform, I can answer it with my own product,

and that's a huge boost for us. Like, if we're thinking about changing our pricing and we want to understand, you know, how many requests or how many what the amount of data, all of those things. We can just answer those questions using tiny word, which has been a huge boost for us in many ways.

Yeah. It's definitely validating when you wanna use the thing that you're building and not just sell it to other people.

Exactly. And, I mean, this is nothing. I'm not discovering anything new, but that kind of approach

is very frustrating when things are not working as they used to, and they drive a lot of decisions. Like, even if customer's not telling you, you know, there are some things that are just plain wrong there and they need to be changed. And that's been a huge source of feedback for us.

Digging deeper into the platform itself, can you talk about how it's architected and some of the ways that it has changed or evolved in terms of the goals and implementations since you first began working on it? We try to keep it as simple as possible. And

for that,

we try to keep the minimum set of dependencies

we can. And at a high level, for instance, we try not to have any

lock in with cloud providers, but we use Google at the moment.

We just use, essentially,

compute,

so we can run tiny bit pretty much in AWS and and in Azure as well. But we don't wanna lock ourselves to any cloud provider right now because

we also see a huge

future of ClickHouse,

and we'll talk about ClickHouse later, I'm sure, takes advantage of every CPU core you take it in. And the more the closer you are to the metal, the faster

you can make it go. We're something we haven't invested a lot of time in, but we see a lot of future in using

metal directly, you know, for huge use cases. But in general, that's something we keep in mind. We try to to keep as as few dependencies as possible. And I was saying at a high level, that's

not blocking with, cloud providers, but at a lower level, do libraries as well. Like, before we decide to use specific library

in our code, we make sure that we understand it really, really well such that we can live with it and change it and do whatever we need to do if if need be. So we don't want black boxes in that sense. And the same goes for ClickHouse itself. Like, we try to and we we spend a lot of time, and we're hiring around Berkeley cloud experts to make sure that, you know, we understand what's going on under the hood, and we can

alter it and and make improvements as we go.

And then in general, so in terms of components, you know, we have some load balancing just in front of our app servers, which are written in Python

with Tornado, and we use some background processes as well.

We have some background processes as well, except from the app servers. And then we use Redis for metadata storage

and then ClickHouse for all of the analytical

queries and so on. So it's not a hugely

complex

setup, and that's largely how it looks like.

And I know that a number of people are actually using ClickHouse for some of the metrics type data in their systems as well for being able to collect logs and manage the infrastructure time series data. Are you able to actually use ClickHouse for your monitoring as well as the product?

Yes. Basically, we collect, you know, all of the stats. For instance, when we're working on a public web page, it's gonna show all of our traffic in real time

through tiny bird API endpoints. So you can see, you know, as traffic is coming in, you're gonna be able to see it at scale because we drop all of that back into

our, yeah, ClickHouse server that has tiny bird on top, and we can quickly analyze it and so on. So yeah. Yeah. We we use it for everything.

And you mentioned that it's sort of the core building block, and that it's something that you found when you were working at Carto. But as you were

revisiting this idea of I wanna be able to deliver

real time APIs as a service to other people,

what were your overall criteria

in terms of the decision making? Did you think about anything else besides ClickHouse, or was it just ClickHouse all the time and you you knew that going in? And just curious sort of what the decision process was as you were deciding to stake your business on this piece of technology.

We fell in love with ClickHouse first,

and then we started looking at alternatives to see, is this the best

out there? But there's a lot of good reasons for us to use ClickHouse. First 1,

if

we're

thinking about, you know, enabling customers to scale to whatever they need to scale,

Kickass

is the absolute best out there. It's super scalable, both horizontally and vertically.

Another huge aspect of it is that it uses SQL, you know, and SQL is, like, you know, lingua franca, let's say, for database systems and developers all around the world no use and no SQL, so they don't have to learn anything new to use tiny bird. Then, you know, it's really, really good. The queries are have

super low latency,

which if you're trying to build a real time

product, you need. And that's the problem with trying to do that with systems like BigQuery or Snowflake or Redshift, which are great at running queries over huge amounts of data, but you always have, you know, some latency that makes it really hard to scale

vertically, let's say. If you're gonna hit that system with 100 of queries per second,

first, you're gonna have to throw

a lot of money to solve that problem so that you can scale that up in terms of servers and so on. And, you know, building so solving real time with endless

money, it's possible, you know, but the key is to solve it with at a budget. And that makes sense. So

ClickHouse allows you to have super low latency,

and

also a great ingestion rate. Like, it can ingest incredibly

fast, and that's what we consider real time. You need both. You need to be able to ingest the data really fast, and you need to be able to query really fast. Otherwise,

you know, if you're adding a lot of delay on 1 side or the other, you're basically moving away from real time, and and that was a key factor for us. And then thing I mentioned a bit earlier, it has state of the art algorithms in the sense that it's built to take advantage

of every single CPU core you throw at it, and that's why it is so able to scale. And, also, super important from the business point of view is a super well maintained

open source project. And as a thriving community, it's adding more and more people to it every day. And for us, that was also a key factor.

And as you have been using ClickHouse

in earnest for a while now, and you've been testing it at various levels of scale. What are some of the sharp edges that you've run into while running it and some of the

other custom tools or patches that you've made to it or systems that you've built around it to be able to help deal with some of those challenges?

In a general sense,

it's something you keep in mind is that there's a relatively large code base, not huge click up, but it's large enough that, you know, there's a lot of corners and areas of code to explore and understand.

And it's really well written and it's easy to follow,

but

sometimes for certain things, you really need to understand it in detail to

know what's going on and to be able to troubleshoot

something that might be happening. We always say that Kikos is like a Formula 1. You know, if you know what you're doing and you have a team of people that understand it really well and, like, you know, in the case of Formula 1 would be the mechanics and the engineers and so on, you know, and you have been driving for a long time, you know, you can make it run at 300 kilometers per hour. But what we're trying to do is that anyone can run it at that speed. And any driver, let's say, any any developer.

So that's something to keep in mind. And there's been things that, you know, we've learned in the process, like, you know, there's some data management

oddities, I would say, that if you think about it, make all the sense in the world, but you don't expect

that to work like that when you use ClickHouse for the first time. For instance, any deletion

you want to do,

you need a lot of disk space to do deletion. As in because in order to delete data, let's say you have a partition that you just want to delete a month of data and you're partitioning by year or something like that. What ClickHouse will do is that it will copy the partition

without the bit that you want to delete.

And then once that's finished, it will get rid of the old you know, do the swap, get rid of the old partition, more or less. So

we found some customers wanted to do some massive

deletion and they couldn't. Or for instance, they had a TTL, like, time to live in their

data source table, let's say, such that certain data needed to be deleted after a day or something like that, like 24 hour. But they were ingesting such huge amounts of data that it was, you know, terabytes of data. And when it was going to get deleted,

we couldn't because there wasn't enough disk. We hadn't realized that it was gonna be so big and the TTL would not be able to work. And that caused us to scramble, extend the disk, you know, those kinds of things. So those things is things that we didn't expect. Even if you understood

the problem

well before designing the system, very easy to overlook, you know, very easy to have a customer that you're basically

allowing to ingest any amount of data to realize that that might be a problem. So

and then things like, you know, for instance, we found

a bug also

that whenever you're doing replication,

Kaka is really good at replicating data and keeping all replicas

up to date. But whenever you do a command that requires service synchronous

ACK

from all of the replicas, like, just so that you're sure that all of the replicas are up to date with a certain command, like an optimized or something like that,

There was a bug that it would wait for all the replicas

even if that replica had died. You know, if if that, replica had crashed,

would wait for all of the replicas to confirm that the command was finished

and basically wait forever until someone would realize and restart that replica exactly in the same way. So things like that, we found along the way and when we found we provided a patch or something. And, I mean, the ClickHouse core team is amazing at, you know, accepting those patches and putting them into the master brands and so on. And, yeah, there's other problems like, you know, the loading of the tables. You need to bear in mind that it's alphabetical, and so if there are any dependencies between tables,

like a joint table or something like that,

and you try to load that table before the other 1, then it will crash, you know, so we'd have to build something

to ensure that the tables are loading the right order and, you know, that we don't have those kinds of problems. So we found a lot of things over time that, you know, you don't think about when you're building a system that needs to

serve potentially thousands of customers, and that's something that we've learned a lot about. In terms of the actual

life cycle management, you were mentioning that for deletions, you have to have double the disk space roughly. And

I'm wondering if you can talk a bit more about in terms of your platform and the user experience of

do you

allow people to just maintain data in perpetuity?

Do you have sort of an enforced life cycle policy? What sort of tuning is available

for people to be able to determine

at what cadence should data expire, and what happens to it after it's expired?

So all the data that you

have in your Tinybird account, we consider it hot data as in

subject to be queried

and needing to be available in a low latency fashion, let's say, at any time. So we don't have a concept of

sort of cold data or hot data. Everything is hot data and and subject to be great at any time. And in that sense, right now, our pricing revolves around

storage

and

concurrency.

At least right now, you need to think a little bit about what you wanna do. And for large use cases, we help our customers serve design their system and recommend, okay, you this is what you're gonna need and so on. But right now, customers have full control of basically what TTL they want to establish.

You know, we help them work with that. And then we are very alert

about potential problems, especially because we've learned sometimes

even if it looks like they have plenty of space or, you know, suddenly, a few customers can be doing different things at the same time and then cause a problem. So we keep a very close eye with alerts and our own observability

layer to make sure that if some customer needs to be warned, hey. You need to be careful here because you're gonna run out of space or you need to upgrade or you need to change your policies,

then we do it on on a sort of a reactive basis, let's say. We are moving towards

making it completely

customer

driven, as in I want

more speed.

That sounds very

movie like, but because we know

how the query is built, and we know how many CPU cores that particular

query is using to return. And we can tell you, hey. If you want this query to go faster,

we can do that. You know, click here to upgrade or click here to get more speed. And we can tell you in advance how fast that query potentially could go. So those are things we're very interested to explore. And the same for disk space, you know, that you can see

when you might be running into problems and that you can extend it yourself

without any help from us.

RudderStack's smart customer data pipeline is warehouse first. It builds your customer data warehouse and your identity graph on your data warehouse with support for Snowflake, Google BigQuery,

Amazon Redshift, and more.

Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and Zendesk help you go beyond event streaming.

With RudderStack, you can use all of your customer data to answer more difficult questions and send those insights to your whole customer data stack. Sign up for free at dataengineeringpodcast.com/rudder

today.

And you mentioned that you have the query analyzer to be able to understand. This is the potential speed that you could get trying to answer this question.

And as people are designing their APIs and the queries that they're trying to, you know, drive the API from, what are some of the performance issues that they might run into or some of the

challenges that they might have in terms of being able to actually

formulate the query in a way that makes sense for them being able to deliver in an API?

We try to be,

fast by default, as in we want the experience to be really, really good from the beginning such that when you upload data or you connect it to a Kafka topic or something like that, you can start working with the data and everything is fast and so on. But, obviously,

we're not putting at least not now. We'll see in the future if there are some things we need to block. But, you know, anyone can

write a slow query if they set their mind to it. You know?

So that's something that, you know, you can't help. So there's always gonna be cases where someone comes in and writes a slow query, especially when you have huge amounts of data. But we try to make a lot of decisions

in advance for the customer that you can always then for the user, you can always then change. Like, for instance, how the data is ordered

in the disk is really important for performance.

So based on the data that we see coming in, we make those decisions upfront so that, you know, we try to

have smart defaults for our customers, for users. And then

the same goes for there's a number of things that if you've never worked with analytical databases,

you probably don't know or you're not used to thinking about in that way with columnar databases in particular. Like for instance,

when you're doing

massive amounts of data, you wanna

make sure that you think about how you join the data intelligently.

And the first typical thing is, hey. You're gonna

join and, well, filter first

and then join such that you reduce the amount of data that you then need to join. So those kinds of things, we are starting to, little by little, add functionality

in the product that when we detect

that type of pattern, we're gonna say, hey. You should change your query like this, It'll go faster. You know? And those kinds of things because we see a lot of different use cases in modern customers and the type of queries that they can do. You know, we can build functionality to help our customers, you know, write better SQL, if you know what that means. And even at times, just not say anything and change the query on the fly before

it comes back. So if you're used to writing Postgres,

SQL or MySQL SQL that, you know, it'll work and it'll be fast even if it's not ideal. So we can't always do that, but there are certain cases

that it's just a question of understanding how the query is structured and and do the change in the background if need be. Those are some of the things that

is worth learning. And, by the way, we did a real time analytics

course

when we were getting started to to get leads and so on. And a lot of people signed up, and it was all about those kinds of things. It's what types of things you need to bear in mind when you're writing queries over huge amounts of data.

In terms of the actual

end user experience,

there are a couple of things that are probably worth digging into. 1 being the actual data loading because as you were mentioning, the way that you structure your queries is influenced by how you load your data and the structure there.

And then the other aspect of that is schema evolution where the data source changes, and then you need to be able to reflect either a new column or a changed data type in the ClickHouse cluster. And so I'm wondering if you can just talk through the overall workflow of somebody who's getting started with Tinybird and building an API and how those data loading considerations factor in. You've touched on something that we always have a feeling that if we get right, we're gonna take on the world because

it's something that it's

really challenging to do when you have a lot of this is going back to the Formula 1 thing. You know? We want that to be as easy as possible without really necessarily understanding what's going on. And things like adding columns,

you know, just adding a column is not that problematic, but, you know, changing

the schema or changing the order of the data or doing those kinds of things, someone will require you to recreate

those tables and so on,

and that is a pain. So with Tinybird, we have not just the UI where you can use the browser and write your queries and so on. We also have a command line interface.

You can think of it as a

type of git like command line interface that allows you to pull all your schemas and all your queries into text formats,

and that you can work on your IDE or Visual Studio or your text editor or whatever, and then upload that to GitHub and work collaboratively and then push back into tiny bird. And we started with the CLI to build versioning

and both for data sources and pipes or API endpoints, such that if you have a data source that, you know, maybe has already huge amounts of data

and you already have an ingestion coming in or several different points of ingestion for the same data source

and you want to add columns or you wanna change something or the order or whatever,

you can create a new version of that data source.

And you can do it in such a way that all the data from the initial data source will pass on

to the new version. And whenever you're ready, you can just point the pipes to the new version,

and then your APIs will not notice anything. And you can continue either ingesting in the old data source and the data will fall sort of, flow through to the new 1 or start ingesting to the new 1. So that's how our versioning system works right now. But we've realized that sometimes

when you're starting especially and you still don't know and you're still playing with the data and so on, there's a number of things you just want to do and you want that to be as

simple as possible. So we're adding things like adding a column straight away, without having to version or anything like that. And in a way that won't break your existing ingestion

because that's something you have to bear in mind. If you're ingesting a CSV that has 6 columns

and suddenly you add 2 new columns or you remove a column, you know, what happens with

the ingestion that's coming in? So we're doing it in such a way that it won't upset any ingestion process or anything like that, and you can evolve that much quicker. And then if you want to have versions, you can also do that. So that's how it works right now. It's 1 of the things that nobody realizes at the beginning.

And then when you start having a big project and so on, it becomes a thing that you need to master.

Another interesting aspect of the platform you're building is how you manage multi tenancy where, for instance, you mentioned customer who had terabytes of data that they wanted to drop a month's worth. And so now all of a sudden, you're running into disk space issues. And, obviously, you don't want that to impact other people who maybe have a smaller volume of data where they're working with gigabytes of you know, per month and then you know, but they just wanna have a simple API. And so I'm curious how you're managing that multitenancy

where do you have dedicated clusters for each customer? Do you have some customers who are on a larger plan who have a dedicated cluster, and then others are on a shared cluster where you have quotas established for usage of the ClickHouse cluster and just how you sort of implement all of that and build it in a way that you're actually able to make it maintainable without tearing your hair out.

That's that's a good question. So we have 2 kinds of accounts. We have both sort of shared

resources, kinds of accounts, like Shahjes, those purely multi tenant,

and we also have dedicated resources.

So the shared approach,

basically,

that means everything is shared. Like, you share load balancing, you shared web applications, you share, you know, the infrastructure is, is the same. And then in terms of kick house, we have,

you know, it's a the churn infrastructure is a huge cluster

with multiple databases

and 1 database per customer, let's say. Each cluster can have multiple instances

and those instances

within the same network. So, you know, we can scale that up as needed. And then, basically, that's how the

shared approach

works. And then

in the case of ClickHouse,

it's slightly different for other databases. Like, a cluster

can have multiple ClickHouse instances, and then each instance

can have multiple databases.

And then those are just essentially collection of tables with their own users and security and so on. So that's how we do sort of the shared approach. And then the dedicated approach can be

1

of 2. 1 is fully dedicated, and that's your own tiny bird, basically. It's just everything.

No nothing shared at all. We do that for large customers that basically want no split whatsoever.

Not necessarily no sharing whatsoever,

and that's fine with us. And we also, in those cases, explore

with our customers whether it's our cloud or their cloud. Sometimes they prefer to pick up the bill, let's say, even if we operate

for them. That's something we're very open to. And then the other dedicated

approach is a mixed 1, which is you have your own dedicated database. That's where you're going to see the performance

improvements and where you don't want any

fighting amongst resources.

And then they share some of the common infrastructure like, you know, load balancers

and maybe some web applications and so on. But, really, when we talk about dedicated resources and high availability and so on, the key of that is on the database on the ClickHouse instances, and we can very easily set that up for any customer.

And as you have

been building out tiny bird and working with your customers and using it for being able to, you know, build APIs for monitoring tiny bird to build tiny bird. What are some of the most interesting or innovative or unexpected ways that you've seen it used?

I mean, recently we've, a customer worked on a use case that we're in love with. A customer of ours is a platform service, type of business, and

they have a huge

CDN,

edge servers, let's

say, around the world.

And they've built with Tinybird their own WAF, like their own web application firewall.

Essentially, how that works is that

all of the logs

from all of those edge servers

are being sent to a Kafka instance from which we ingest.

And then they built, you know, from the moment that a request comes in

into 1 of those edge servers where it's available in tiny bird, is maybe 2, 3 seconds.

It's really, really fast. You consider that's sort of an average across the world. And then

they build some API endpoints

such that each edge server,

every 5 or 10 seconds,

queries the incoming data. So sends a request to tiny bird to see if there's any specific IPs that are generating a huge peak of traffic.

And if that's the case, they will cut that the traffic from that IP to avoid it in out of service attack kind of thing. So that's been

something we were not thinking about at all, and, the customer of ours first built a different use case and then think, actually, would this scale to do this? And they managed to do it super good, and they managed to do it really, really quickly.

And it forced us to be creative about certain things we hadn't thought about in terms of how we'd ingest the data faster and so on. Because think about when there's a denial of service attack, you know, we're ingesting maybe,

you know, 40, 000 records per second on average. But when there's a denial of service attack, maybe that goes up to 3 times that or more. And even if it's just 1 server, it'll be hitting us with a lot of requests, and we need to make sure we can keep up because otherwise,

it will defeat the purpose of what they're trying to do. So it's forced us to improve some areas of the product, like ingestion and from Kafka to give it even higher bandwidth that we had at the beginning.

But we absolutely love that use case.

In terms of your own experience of building out tiny bird and growing the business and helping your customers, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

From a business point of view and in general,

something we believe very strongly in and we've added to our company principles and

something we say, which is speed wins

because speed sort of applies

in every aspect. Your speed is the definitive

differentiator,

however you look at it. You know, the business level, you make decisions faster,

then things move faster, you can do more things, you can resolve more problems. And a lot of the decisions,

most of 90% of the decisions are reversible. So many times it's much better to just make a decision than to hang around and think and ask and bother a lot of people, you know, because a lot of the times, those decisions, it just don't matter. Just go ahead and do it

quickly and then let's see the result rather than, you know, trying to

get paralyzed by the analysis. And then the same goes for technology and infrastructure. If your product is faster, you'll need less infrastructure

to serve the same use cases, which means less costs and better business.

But the same is for the user experience. The faster the results, the happier the customers would be, you know, the more they'll talk about tiny bird. And so it's something that we already thought, but we've seen it in such

clear way that we've made it like a company principle. And and, you know, you can see it in our chats

often is like someone will ask something, hey, do you hey, what do you think? It's just like, the answer is oftentimes speed wins, you know, which means

you decide. It doesn't, you know, it doesn't really matter. It's not that it doesn't really matter. You know, it's probably better to just quickly

decide and move forward than to hang around with this and so on. That's been 1. And then another 1 from a point of view of business as well is,

you know, we have grown convinced about

real time being something that

will be the norm in the future, and that's a huge thing for us

because I think a lot of companies live with batch processing and times

as a necessary evil, like, something that that's the way it is. It's like background noise, something that you don't necessarily notice

until

someone switches

that background noise off. And when you realize that some of the things you can do in real time, you start thinking,

wow. What else can I do in real time? You know? And that opens up a lot of new ways of thinking about your business and opportunities

about how to do things,

and that's been a huge discovery for us. Yeah. Those are couple of ones

on a more pragmatic basis. Something we've learned is that data is always dirty.

So whenever you think that, you know, yeah, just

start ingesting and blah blah blah, and, you know, it's very easy, then there's always problems with data that, you know, we assume

it's never as easy as it looks with data, especially when you have huge amounts of it. But, yeah, those are some of the lessons that we've learned over time. And for people who want to be able to deliver analytical APIs

and do it at high speed, what are some of the cases where a tiny bird is the wrong choice and they might be better off either building something internally or using some other off the shelf product or system?

Obviously,

this is purely analytical, so it's anything resembling transactional use cases. This is not the right product

for,

you know, either just

using Postgres or MySQL or, you know, any transactional database or

new services that are coming out now that are databases as a service but geared towards the transactional use cases is would be a better choice. And

then for use cases

like point queries,

you know, it's not that they never would be the wrong choice, but we are not particularly better than

other systems. You know? This is I mean, if we have some of those and it's really fast by any means, but not necessarily

faster than any other system. So that and key values,

in general, it's not ideal. It's anything that's

time series and and so on is great

and where you can throw huge amounts of data and so on. And we talked about in product analytics at the beginning, another reason why tiny bird is great for that is that you only do queries by company ID, let's say, or customer ID that enables you to limit the amount of data that you're querying just by default,

which even if you have huge amounts of data in total,

you're always gonna be querying for a particular company ID. You know, those are the kinds of things that really make sense. But, yeah, transactional use cases

or point queries,

you know, those are not the ideal use cases for Honeywell.

And as you continue to build out the product and the business, what are some of the things that you have planned for the near to medium term? 1 of the things that happened with Honeywell is that our first customer was a huge customer,

and we were, like, we're just 5 or 6 people, and we were thrown into

meaning to deliver that use case.

And that forced us to focus on scalability

and reliability

and performance,

but not necessarily in the user experience because we were basically running the show and sort of building out the product and the solution, let's say.

But the future for us is in enabling developers. So we want to

make this so easy to use for any developer that they just don't think about anything else And developer or data engineers out there

so 1

huge sort of

change in terms of focus has been that over the last few months. So we're investing a lot

in making this super easy to use, to connect to any data source, to start ingesting, to start building queries, to build APIs.

So that's something that we're gonna do more and more. And that also means adding the toolset that developers are comfortable using. That's why we've added a command line interface.

That's why, you know, we make it really easy to integrate with GitHub, all of those things.

Apart from that, we are

super focused on high frequency

ingestion,

high concurrency

type of use cases.

That's where

we see the market going

more and more, and that's where we're really good at and where we can scale to to handle

pretty wild use cases. And that's where our main focus is gonna be for the foreseeable

future. And then in terms of the business, the company was born in Spain, and we have now starting to have customers all over the world, and we're gonna start, you know, expanding into those territories like the US and the UK and and the rest of the Europe over the next few months. So we're gonna start making a lot more noise.

Well, for anybody who wants to follow along with the work that you're doing and keep in touch, I'll have you add your preferred contact information to the show notes. And as a final question,

I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. You know, 1 of the areas where we see

a lot

of opportunity

is in the sort of data ops aspect of building

applications and data products.

If you think about web development

over the last 15 years,

a lot of people have learned and take for granted things like continuous integration and, you know, testing, test driven development, things like that. And that's something that we don't see that as much with our customers. Like, they don't have that kind of approach when it comes to data products.

And I think there's a huge opportunity there and something that hopefully we can help drive from Tinybird

to sort of establish the right way to build

data products. You know? What are the best practices and what are the right types of tools to do that. And that for us is something that we miss when we build data products and that has sort of guided us towards adding

certain functionality to the product that we didn't expect

initially we would need from the point of view of, you know, being able to automate certain things,

like testing, doing checks automatically when you're pushing new endpoints,

to the system,

enabling,

you know, continuous integration in an easy way and integration with a source code repository, all of those things.

I think it's a big gap and something that

teams need to build by themselves

a lot, and it takes time and learning and so on. And that's something that when we've had to run big data products, it's the first thing we think about because we know we're gonna run into that

soon enough. So we start with that, and we wanna sort of to see how we can help developers learn how to do that, not just with the web development products, but with their data products as well.

Well, thank you very much for taking the time today to join me and share the work that you're doing at Tinybird. Certainly very interesting project and 1 that suits a big need in the overall ecosystem. So I appreciate all the time and energy you've put into that, and I hope you enjoy the rest of your day. Thank you very much for having me. It's been great.

Listening. Don't forget to check out our other show, podcast.init@python

podcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at data engineering podcast dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links