A Candid Exploration Of Timeseries Data Analysis With InfluxDB

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's

l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

RudderStack is the smart customer data pipeline.

Easily build pipelines connecting your whole customer data stack, then make them smarter by ingesting and activating enriched data from your warehouse,

enabling identity stitching and advanced use cases like lead scoring and in app personalization.

Start building a smarter customer data pipeline today. Sign up for free at dataengineeringpodcast.com/rudder.

Your host is Tobias Macy. And today, I'm interviewing Paul Dix about InfluxData and the different facets of the market for time series databases. So, Paul, can you start by introducing yourself? Yeah. I'm Paul Dix. I'm the cofounder

and current CTO of InfluxData.

We created

a time series database, an open source time series database called InfluxDB

back in 2,000

13. I started the company in 2012,

and then we pivoted into time series databases in 2013.

My background

is obviously I'm a developer, trained in computer science,

you know, other than an interest obviously in databases and stuff like that. I've done work in search and information retrieval,

had an interest in machine learning for a bit.

And you mentioned that you ended up pivoting into the time series space. I'm wondering what your original vision was for the company and what your impetus was for making that pivot. As I mentioned before, I have a background in machine learning. So in

2012,

I started a company with my co founder and the idea was

to create

a SaaS application for doing real time metrics and monitoring and stuff like that. So

in the same vein as like Datadog or New Relic or Stackdriver or server density or whatever, and my plan at that time was that I would use these machine learning techniques and apply them to, you know, this kind of server monitoring data

so that I could do things like anomaly detection,

predictive analytics,

you know, help you with root cause analysis.

So it's this very common thing in with machine learning plans, which is I'll get a bunch of data, I'll do machine learning, and then magical answers will come out the other side.

You know, we built an app. We built a bunch of infrastructure.

I had to build infrastructure for storing and processing time series data at scale,

which was kind of a pain to do. And we had to build all this other stuff.

So after, you know, a year of building things, we actually hadn't got gotten to the point at all where I was actually doing machine learning stuff. I was still doing infrastructure.

And

we weren't really picking up customers in a way that I thought would

allow us to, you know, achieve escape velocity. Either hit the next milestone for our next fundraise

or, you know, just have enough money coming in the door where we could just pay ourselves and not have to fundraise at all.

So

what I saw was

when we had to create the infrastructure for this app, the time series solution that I built was essentially

web services written in Scala using Cassandra as the primary store and then using Redis as this, like, indexing layer. And this was just sort of, like I said, like metrics data,

raw events, like application exceptions, and all sorts of other stuff.

And

I saw a couple of things. 1, for the people who were paying us, I asked them how they were using our app,

and

what I saw was that they were basically using our API, like a time series API, and then just using some of the core, like, dashboarding functionality that we had built in in the web UI. And

infrastructure that we were building was actually more interesting than the original app IDI had

because,

1, that core infrastructure was kind of a pain to build, but also I had done the exact same thing previously

in

2010 at a Fintech startup that I was working at. So for that,

I was working with another guy and we had been brought on to essentially build a time series database to track

basically a few 100000

Symbols in the market

that we're getting real time pricing updates.

So basically once every 10 seconds, a pricing engine would give an update, layer that in with actual real market data.

And the solution I built back then for that was also

Scala Web Services with Cassandra

and Redis. And the thing is, like, in Fintech,

then I think as well as now is there were basically like a few solutions that were common for time series data. There's 1 called KTB

and another called OneTik,

and we had been using OneTick at the time. Within Fintech, their needs were quite a bit different than what server monitoring needs looked like. Server monitoring was 100 of 1000 or millions of unique time series that were ticking at, you know, intervals of, like, once a second, once every 10 seconds, once a minute.

FinTech data was usually much higher velocity

and far fewer independent time series.

But the thing is like

for the particular thing we were doing in FinTech,

we had a need to track 100 of 1000 of different things.

So the same happened,

you know, with

the application server monitoring and metrics. I saw that it was kind of the same thing. I saw essentially that time series was an abstraction

that could be useful for solving problems in a number of different domains. 1 could be server monitoring,

1 could be financial market data,

sensor data was starting to become a thing. And to me, like, sensor data

just looked like another variation of server data where it's like sensors is physical sensors,

server data is obviously software sensors.

And then the last category I'd call I'd say is, like, real time analytics which is basically just like event data that you're tracking over time that you wanna ask questions about. But ultimately, like the thing that's common is you're asking questions over time,

The load profile or like what, how this data is shaped and how what the usage profile looks like is very, very different than a standard database workload. So in databases, you

have, like, OLTP databases, which are transactional databases. This is your typical relational SQL database.

Right? If you're gonna create banking software and you wanna track accounts, like having transactions and all this other stuff is super important. Then there are OLAP workloads, which are

analytics. This is analytical processing

and historically this has been the realm of like data warehouses, right, where you're writing in a ton of data

and you're running reports. So those reports can take a while to to run. You run them like once a day or once a week or whatever.

So the trend that was actually beginning to happen was that there's a need for more and more real time

OLAP style workloads. OLAP workloads being like, you're not updating records,

Right? It's not reference data for like what customers or whatever. This is all like

historical data of things. It's like events that you're striding in. So it's an append only workload.

So the real time aspect was basically the things that people wanted to do. They wanted to have a dashboard that they could show in real time and have it refresh once every few seconds,

and they wanted to have monitoring and alerting rules that automatically process this data as it came in or periodically in batch

to trigger alerts or to trigger automated systems and stuff like that. So,

the pivot

came in 2013,

basically in the fall of 2013,

and I thought, well,

you know, the SaaS application isn't taking off.

Maybe

we can refocus our efforts, you know, take the infrastructure we built,

use the same technologies, and at this point, I'd rewritten this, like, back end,

you know, monstrosity with, like, Scala and Cassandra and Redis. I'd rewritten it into, like, a single thing in Go that actually looks much more like a database at that point.

So, yeah, I thought we can take those same technologies, repackage it,

release it as an open source project because I thought if it's gonna be infrastructure that other developers are gonna use, they're going to want it to be open source so they can use it however they see fit and take it with them job to job.

Yeah. And then

we named it InfluxDB and it kind of took off from there. So in that story of sort of how you started and how you pivoted, I imagine that that's some reflection on how you got involved in data management. But I'm wondering if you can just add any more color to, you know, whether your sort of experience and focus on data management started before the company or just any other background you wanna share on that point? Yeah. So it definitely started before I started this company. As I mentioned, like, I started the company in 2012. I've been

working in tech since 98

and working as a programmer since

2000

or 2,001.

So the stuff I wrote initially was like business applications and stuff like that. I started getting more focused on like, data specific applications, like, when I went back to college

in the late offs.

And at this point, I focused on kind of a few different areas. Right? 1 was machine learning, which obviously like

data management is a big piece of that, right? Half of the stuff you're doing as a machine learning person is data engineering.

And then the other pieces were

like database technologies and really search and information retrieval. I thought search and information retrieval was super interesting. Like I took a graduate level course on it, and we talked about, like, building indexers and latent semantic indexing and all this other stuff.

And actually in the summer

of 2007,

I interned at Google and built like a very specific

search application

there.

My career over that point kind of progressed from there where some of it was the search stuff, a little bit of machine learning mixed here and there, and some other companies and projects. And like I said, the, the Fintech startup in, 2010 that I worked for, there was definitely, like, in the data realm.

The thesis for that entire startup was essentially to take a market,

specifically the bond trading market, which is known as an over the counter market and add more transparency to how things are priced in it. So it's basically like taking a bunch of data, mixing it up and cleaning it and making it presentable to users who can then,

you know, drive intelligence from it and stuff like that. You mentioned that the initial pivot of Influx resulted in the open source database that has been around for a number of years now. It's fairly well known.

But I also know that since then, you've added a number of different layers. You've built out an entire vertically integrated stack for metrics management that you've labeled the tick stack. I'm curious if there was any back reference to the 1 tick system that you mentioned before or if that was just kind of kind of an accidental

similarity there and just kind of the current focus of what you're building at InfluxData?

Yeah. So oh, yeah. This is a company initially was this company called Airplane ER. But when we did the open source database, we called it InfluxDB.

And at that time,

my intention was really to just focus on the database part of it. Like, as the popularity

of that took off pretty quickly. And over the course of the first, like, 6 months of 2014,

what I saw was people were picking it up and

trying to solve certain problems with it, but they had a common set of needs.

Right? So the database part is essentially how do you store and query the data, particularly at scale for time series data?

But another was how do they collect the data?

How do they visualize the data?

And then how do they process the data so that they can do either ETL or like monitoring, learning and stuff like that. So essentially, I kind of like

bucketed these into 4 different key areas

that you would need to address if you were going to build

an app a time series application or really kind of any analytical application.

So

so I was the founding CEO of the company. I'd raised the seed, right, we did Y Combinator in the winter of 13. I raised the seed round of funding right after that.

And then in 2014, I went out to raise the next round of funding. At this stage, it's me and 2 other guys who are in New York and

my thought was like, I just wanna, like, add some additional seed funding so that we have more runway so that we can just continue building this open source thing, kind of see where it takes us. Right?

And as I was talking to

investors, both here on the East Coast and there in the Bay Area,

I found a few things. 1, people didn't wanna do additional seed capital,

and I didn't have anybody else I could go back to to say, like, hey, can you give us more runway?

But there were a few investors, particularly in the Bay Area that were interested in potentially doing a Series A.

And so I started

reworking my pitch to be something more of like a series a pitch. Very early, I thought it was gonna be like, okay, open source time series database company, and that's the pitch. Right? Like, MongoDB

in the fall of 2013 had just raised a big round of funding at, like, a $1,200,000,000

valuation.

I was like, you know, this is gonna be like MongoDB, but for, like, analytical databases and time series and stuff like that. And

what I found with the investor audience was that they all seem to have like open source database fatigue.

They had done too many investments into open source databases, so they're just like

over indexed on it.

So they're looking for something a bit more. And

the backdrop of this was that at the same time

we had this, you know, user base of people that were starting to adopt it that had these other problems.

And I looked around and I saw that, and I also saw what Elastic had done with the Elk Stack. Right? Elastic started out with Elasticsearch

and then Logstash and Kibana had been built

by independent people completely outside of the company, and then they went and acquired those projects and those people because they saw that their users

were getting value out of, like, the combination of those pieces together.

So

based on what I saw, I thought it was a pretty obvious move

to essentially do the same thing, but for time series specific applications.

So, my pitch in the series A

was basically

we're gonna have this whole stack that we're gonna build. Like, right now, we essentially have a prototype of a database.

Give me a stack full of cash so I can hire a bunch of developers, and we'll build this, like, platform. And it's gonna be 4 different pieces, 1 for collection, 1 for storage and query,

1 for visualization,

and 1 for processing.

And I'm gonna call it the TIC stack

because so tick

is actually a play on financial market data, like an individual data point in the time series is called a tick, which I think is basically a play on ticker tape. It was back in the day when you had prices coming out, there was a little ticker tape and blah, blah, blah. So in finance,

when people build a database to store this data, they usually call it something like a ticker plant or something like that, and this is why OneTick,

that database in the financial market space is called OneTick because OneTick is a play on that. So I thought

tick stack would be a play on that. A tick is basically just a data point in a time series, and we're building a stack to deal with it. So it was a long way of saying

what the tick stack is.

So in terms of the overall

ecosystem of time series, you mentioned a few different categories and broad buckets for reasons somebody might want to collect and process time series. And there's

a huge market in databases and analysis

for time series and even more in recent years.

And I'm wondering if you can just give

your assessment

of where Influx sits in that overall space of time series database and time series analysis

and some of the sort of core focus that you're orienting your business around for for being able to solve for the overall position within that market?

So I guess for use cases,

there's use cases or specific areas, but you have, like, metrics data, which, you know, you find in server monitoring, in sensor data and stuff like that. Metrics are basically like

measurements taken at fixed intervals of time with some additional metadata that describe dimensions on which you'd like to slice and dice it. You have just raw event data, which could be analytics data. It could metrics data can also be represented as raw event data. Right? It could be

the web page, it could be events happening in the real world, like, just raw event data where you have just a bunch of different things, and ultimately you have a time stamp at which the event was recorded or occurred.

So then that's useful in

real time analytics, like business intelligence, user analytics,

obviously,

server monitoring,

application performance monitoring,

sensor data.

Right? This is all still kind of in this real time space where you want to be able to answer a query in time to show a dashboard that would, you know, repeatedly updates

in real time or to do monitoring and learning on it. Again, like in a semi real time fashion, you know. Real time is kind of like in air quotes because it's like, okay.

Do you need it within 10 seconds, 1 second, a 100 milliseconds,

10 microseconds?

It just all depends. Right? On the use case and what you need. And then

further out, like, but still in the analytical space is kind of like data warehousing,

data lakes, all that kind of stuff. So like Hadoop data warehouses and then Hadoop tried to take over that and like completely failed. And then

it kind of got taken back by, you know, object storage for the data lake side of it, and then more data warehousing,

you know, in the form of like Snowflake and a bunch of others.

So,

in this space, like traditionally, InfluxDB has been in

the solidly in the metrics category, but also a little bit in the events.

And the thing is like this is kind of like a limitation of the of the technology itself. My vision was always to create

a platform and a data store that could deal with just

raw event data and it could

induce like metrics or whatever on the fly from raw event data, but basically to be able to store full high precision data for any historical

whatever

in the real world or in virtual worlds or in digital world or or anything like

that. But really with a focus on answering queries very quickly. So not like

massive scale data processing like you get with large map reduce jobs and stuff like that, but more like I want an answer within seconds or

even better tens of milliseconds.

So

where we're kind of moving more and more towards is getting 1, like improving the technology so we can get a broader array of what we can cover.

So if you take just like server monitoring,

observability

is like a problem space. Right?

Observability has like the holy trinity

of functionality

for it. Right? Which

is metrics, logs, and tracing. Right? It's basically the holy trinity. Honestly, like all these things are just different views on the same thing, which is what's going on in my stack.

Metrics are basically summarizations

of raw

data.

Right?

Logs are basically just like un semi structured or unstructured

output of this, and tracing is basically structured

absolute highest precision. Like, if everybody had their way, the best thing to do would be to just store the tracing data and forget all the other stuff to basically just induce

whatever you need from the tracing data as needed.

But the problem is that can be incredibly expensive to do at scale.

So,

basically, people split this up into 3 different areas and they kind of like,

you know, you have to have a different view on it depending on what you're doing.

What we're trying to do with Influx and particularly from next generation of the technology is to make it so they can store tracing data, for example, or raw event data, and that the data store is useful for that while still being able to answer queries real time.

In terms of the

focus of the technology, you mentioned that it is biased towards

being able to do quick analyses and quick answers,

smaller subsets of data than you might do with, you know, some of these petabyte data warehouses.

And

my experience

of being exposed to inflex data has always been in, as you mentioned, this metrics use case of I wanna be able to just fire off a bunch of data about my systems and then analyze it and generate alerts.

And I'm curious

how that initial focus and initial exposure has

influenced the direction of the business. And as you're thinking about

reenvisioning and rebuilding the technology stack around

it, how your experience within that metric space is informing

the design and architectural decisions that you're making

and any kind of legacy aspects of the system that might be

complicating the efforts to expand its capabilities and expand the available use cases?

Yeah. So the I mean, the business has been totally informed by

what Influx was good at initially and kind of where it got adopted, right, because that's obviously where we where we built the business. I mean, we went to the path of least resistance,

and it's funny because

when I pivoted out of airplane out of server monitoring real time metrics

into,

you know, an open source time series platform and really just like an open source time series database,

it was because like I at that time and still now is true, I viewed server monitoring in real time metrics as a horribly crowded space that I didn't

have a particularly

strong

view on. Like, I wasn't the person to provide the best product in that space. But I did have a strong view on building a tool for developers

where they could use it to solve their problems. So that's why the platform focus, that's why the database focus because it's like, I can't tell you how what the great app looks like, but I can hopefully help you build a great app by giving you a tool to do so. Right?

And

the thing is like we did this pivot and the people who initially picked it up are people who basically wanted to roll their own monitoring stacks in larger companies usually. Right? So they pick up Influx as a tool, which essentially ended up pulling us back into this server monitoring metrics

space,

but in a different kind of way. And the thing is like, obviously,

like we need to figure out a way to pair our developers and continue to write open source code, so we had to figure out a way to turn it into a business, and we kind of just took that pathway.

But

at the same time, like the API that I developed initially was kind of about this idea that it's not metrics that you're writing in. It's really the goal was like events. I wanna be able to write any sort of event data in and do a query on it later to to gather, like, summary information if I want or the high precision raw information if I want. Now,

like, obviously, like, no technology is perfect. Sadly, neither is Influx.

And

my vision

for what I wanted the platform to be able to do was far greater than what we've been able to to do along the way.

So

Metrix was the first thing that was actually a pretty decent ad, but the thing is as people use it, you know, they get comfortable with this API that it has, this like push API where you can just push events and stuff like that. And they wanna use it in more ways where you find like events being a thing. So 1 of the problems

people come up against pretty quickly is this idea of this cardinality problem. Right? So most metric systems,

right, whether it's

Prometheus or whoever,

have this kind of like data model where you have a metric and you have tags and then you have value and timestamp.

Influx has this, but Influx's data model has a bit more than that because again, it's not just about individual numerical values, it's about tracking a bunch of other stuff. But basically the cardinality problem is

what you write into your tags. And these become things like dimensions on which you wanna slice and dice your data. But the way people want to write data in is like

tracing, for example.

They wanna write in the span ID

or the trace ID or whatever.

And what that means is as you write the data in,

each row you write in will have a unique value.

Right? And for the way metric systems are designed,

they basically completely fall over if you try to do something like that. And literally

every metrics database will tell you, like, don't do this.

Now

what's happened over the last, I'd say,

5 years or so is every single system has gotten better and better about the number of individual time series you can write in. And really the primary driver of this initially was not because they wanted to be able to host tracing data,

but because

they wanted to be able to track time series data in containers.

And the problem is containers can be short lived. They're ephemeral. So what this means is that over a long enough period of time, the number of unique time series you're tracking goes up and up and up. But you're not necessarily querying that data. So there are a number of, like, hacks that the metric systems, including Influx, have done to make it so that you can have a femoral time series,

so that if taken over

the entire span of time, you have really high cardinality, but within individual blocks of time, you don't necessarily have that. But still,

that limit is there and when

I see our users trying to use the system,

they don't wanna have to think about,

you know, is this value I'm writing in going to going to ruin my database? Is it gonna make it so I can't query the data back out? Is it gonna make my database explode? Right? That's the primary limitation that we wanna lift. And, you know, I keep mentioning tracing as an example, but

if, you know, the next generation of Dimplex DB doesn't get used for tracing at all, I don't really care. Like it would be great if it is useful for tracing.

What matters to me more is providing a developer experience where people could just write the data in, you know, not have to do a bunch of upfront schema design, which is kind of what Influx is about. It's about write the data in, it's schema on right, you know,

and you can query it. The goal here is the lowest possible friction you can have for getting something up and running.

And what I think will be interesting is

without people having to worry about cardinality and without them having to think about, like, what is a tag, what is a field, like, input specific stuff,

What kind of applications they would create on it? What they would do with it? And again, like, all of this is obviously, like, within the scope of this

kind of

time series or historical

data.

You're tracking

real world observations or theoretical observations or whatever over time, and you're running queries on that later or you're monitoring it as it comes in.

We've all been asked to help with an ad hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do?

Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time?

Today, there is a better way. With Census, just write SQL SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, HubSpot, and many more.

Go to data engineering podcast.com/census

today to get a free 14 day trial.

In terms of the actual

architectural aspects of Influx DB,

and I'm interested in digging more into the data modeling question later as well. But the database itself, I'm wondering if you can talk through some of the

design and architectural decisions that you made given the priorities that you had as you were developing it and the focus on

the developer experience

of the using the database

and just some of the ways that the

implementation

and design and goals have changed since you first began working on it as you continue to work with customers who were pushing it more into this metrics direction?

So in terms of, like, the core database technology,

there are, I think, basically, 4 major

periods of time in Influx lifetime. The 4th being 1 that actually

isn't being used right now. It's 1 that is in active development.

Right? It's public, but we're not producing builds or anything like that. So the first 3. So the first phase was essentially

the very initial versions of Influx

and it basically had a slightly different data model than what exists now,

but there was still this model of there's an HTTP API

where you can write data in and it'll just create the schema on the fly as you go. And this was more about like, you have like tables with columns and stuff like that. And

what I saw how users were using it was

if they knew exactly what they're doing, they could get decent performance on the query side, but there were many instances

where they would get abysmal performance on the query side, right, which

would necessitate the need for secondary indexing, which would then necessitate the need for users to define indexes

and all of this other kind of stuff, which obviously exist in normal relational databases.

But

I had this view that, like, time series was a thing where you could actually

take a bunch of the friction out and kind of make decisions for the user that would just kind of, like, lead them down the way down the path to getting reasonable performance, assuming they're not operating at the edges. Right? At the edges, like everything's gonna break. So you're gonna have to put extra effort into it anyway.

So

that was like the first InfluxDB data model, which was basically all version up to 0.8.

So then the next version, 0.9,

was where I introduced the data model that is what exists today. Right? You have measurements, you have tags, you have fields,

which are in the timestamp. And the tags are basically just string key value pairs where the values are strings. Right? Those are dimensions

on which you're slicing and dicing your data, and then the fields are actually the raw values, but the difference here is that the values can be a float int, a bull or a string. So it's not just constrained to numerical values.

And the idea for that model was that

this kind of

structure would make it so that we could create the database essentially as

1 indexing layer and then the raw time series data. So the indexing layer is essentially an inverted index that operates on the measurement names and the tags

and normally use an inverted index in document search where you map

a term in a document to the list of document IDs that it appears in. In this case,

the inverted index maps a tag key value pair to the list of time series that match that criteria.

Right? And then the raw time series data.

So that was version 0.9

and basically version 0.9 through up to 0.11

was time where we spent trying to tune this structure. So what we found was that the data model was something that made a lot of sense to people, but we couldn't get the performance characteristics of it right. Like we were trying to use like a particular storage engine at that point, like a bunch of stuff wasn't working. So

the next phase, which was basically like well, I guess this is still kind of like in phase 2. So we essentially created our own

storage engine to store the time series data.

This was called TSM. So at this point, what we had was we had all the time series data which was stored on files and then the inverted index

we had in memory. So basically queries could come in, you consult the inverted index in memory, and you map it to the raw underlying time series data and you do a query. Great.

The problem that created was that this metadata index, which again is like all those tags,

resides in memory. So

what that means is if you have a lot of ephemeral series, if you have high cardinality, it kinda blows up your memory footprint over time. So our next major change that we made

was we created

an indexing structure

for this metadata

that would be on disk and memory mapped.

So that was something called TSI.

That's where we are now, is we have this structure where we have an inverted index that's memory mapped for this metadata and then the raw time series data. And this works really, really well for

metric style data and a number of other open source projects use the same

kind of structure.

This doesn't work well

for

situations

where you have high cardinality.

Right? It doesn't work well when you have many, many time series

or when you have

many individual events,

which on their own probably wouldn't be a time series, but you'd want to construct time series on the fly

based on different ways you can view the data.

So

that's what this 4th phase is really about,

which the 4th phase is about a number of other things as well, but we're essentially

rearchitecting the core. So basically from 0.8

to 0.9, we rearchitected the API.

From 09 to 0 11, we basically spent all that time, like, figuring out the core of the storage engine.

And then up until I think like 1.3,

0 11 to 1.3, which was like 5 or 6 releases,

we were just tuning that, working on that, and then we introduced the TSI. So the TSI stuff was really just additive to what we were already doing with the storage engine. Once we hit 0.11,

everything from that point on was basically additive changes to that existing core technology.

This

new phase that we're in, we're basically completely reworking

how the internals of the database is structured. We no longer have this like inverted index and time series store. Right? In the new phase, it's a columnar database.

So it looks like a columnar database, an in memory slash on disk columnar database.

And it's actually in a completely different language. So

InfluxDB,

including the current version 2.x,

is written in Go and it has been that entire time.

This new core of the database is actually written in Rust.

So there's been a lot of evolution over time

based on things that we've seen people trying to do with the database

and just basically other developments

completely outside of Influx inside the industry.

And I'm sure that at least a decent portion of the motivation for using Rust for this new core is because of the guaranteed memory safety and some of the mathematical correctness

of the language runtime itself. But I'm curious if there were any particular

issues or complexities

or error cases that you ran into with Go that pushed you into considering a new language for this new core of the engine?

So, I mean, obviously, there's the garbage collection stuff. Like, we wanted to have more fine grain control over memory.

1 of the other things is, like, the compiler in Rust prevents you from doing data races.

Right? Which has been a source of bugs over time in Influx. And the problem is those bugs are the worst possible bugs because they usually only occur when you're under loads. They're really hard to reproduce.

They're really hard to track down.

Right?

So Rust, basically, the compiler just doesn't let you do that, which is awesome. The thing is, like, Go has stuff in the language to help you write

multi threaded code and all this other stuff, right, the channels and whatever,

but

frequently you don't end up using channels. You actually just use the sync package and stuff like that. And there's no guarantees there. Whereas like Rust, the compiler won't let you do, you know,

share stuff across threads in an unsafe or unpredictable way. Like, it just won't let you do it, which is great. There were some other, like, needs I had or at least I thought I would have

earlier on

that attracted me to Rust.

I thought I may need to bring in a bunch of c and c plus plus libraries,

and Rust gives you a 0 cost abstraction to do that. Right? You don't whereas Go, you have the c Go interface that that you have to go through.

We wanted to make certain parts of this project embeddable in other languages,

like particularly in other, like, big data platforms. Right? Like a bunch of the stream processing platforms and stuff like that are written in Java. If you have something written in Go, there's no way to cross that

divide other than the horrible serialization methods.

Whereas,

if it's written in Rust, you can create a c interface around it and you can embed that in Java or any other language that can embed C code,

which obviously we haven't done yet, but I just liked that as a possibility.

But ultimately, I think

Rust as a language

has a number of constructs. They make it easier to write correct code.

Right? No matter what, when you write code, you're gonna create bugs. That's how it is. But

it feels like there are certain classes of bugs. There are certain things that just won't happen in Rust where they are more likely to happen in other languages.

And

And another large element of the Influx DB

system is the horizontal scalability

aspect, which is something that you focused on from the initial implementation.

And it also

came to be,

from what I saw, a bit of a point of contention as you progressed in the journey of the product of whether or not that horizontal scalability was part of the open source versus the paid package.

And so before we get into that aspect and some of the open source strategy and your thoughts there, I'm curious just in the technical aspects of how you manage the horizontal scale out of InfluxDB

and where it falls in terms of

CAP theorem?

So it depends on which Influx DB product you're talking about. I mean, if it's the open source Influx DB, that's a single server. So there's nothing there. If it's the enterprise product of InfluxDB,

there are 2 different types of data within it. 1 is basically metadata about what databases exist, what users, all this other stuff. That is a CP system. It's

a raft protocol based system that makes sure that it's all there.

The actual time series data is more of an AP system.

AP is basically how

our Cloud 2 product and how Iox, the new InfluxDB

are designed. Right? They're designed more around

AP

in terms of of those things. Like,

if you want consistency,

you can layer it in

above it, but the core of

the different pieces working together is about getting the data in

and then having an AP system. But ideally being if you want to, being able to query it out and to ask for a consistent read. The thing is consistent reads are always more expensive. And for many of these use cases, people don't wanna pay that price.

So

forcing people into a CP system where an AP system will do,

it's kind of painful.

To your point of being able

to add the consistency aspect on the software layer that is interacting with the database, I'm wondering if you can dig into how the

telegraph

implementation

and the other layers of the TIC stack

are able to offer some of those consistency guarantees at least in a tunable variety for people who are using

them? Well, so telegraph is really just a data shipper. Right? It's or it's like a connector. It reads data or gets sent data and it sends it off to Influx.

There are no consistency guarantees on that, right? Like it gets the data and it sends it. It will retry things that may have failed.

Individual rights to Influx

are idempotent

if the writer is actually supplying the timestamp for that data,

which is an important

aspect

of Influx.

Otherwise, like having an AP system won't even I think really even work for us.

So

as far as the query layer goes,

again, it kind of depends on which Influx

product we're talking about. So, for example, our current cloud product, Cloud 2 or just Influx DB Cloud is essentially what it's called.

When you write data into the platform,

it does authentication and authorization

at the front door. Right? So that is consistent.

And then it pipes the data into Kafka. We use Kafka as a distributed right ahead

log, and then the storage servers pick up that data. And then the query servers, when they get queries, will hit the storage servers dashboard.

They will read from the storage server that has whatever the latest Kafka offset is.

So I guess that 1 is consistent because Kafka gives us that, but only if you're actually

continuously hitting the forward Kafka offset.

And then

going back to the data modeling question, you were saying how in

the more recent versions, your goal is to be able to

defer some of the

upfront decision making that developers and end users have to think about. And I'm wondering if you can just talk through some of the data modeling considerations

or, you know, potential foot guns that people might run into as they are starting to use Influx and ship data off and how that might impact their availability to query and aggregate the information that they're sending?

So

within Influx, you have measurements.

Within a measurement, you have tags and fields. Like I said, tags are indexed,

fields are not.

So if you're running a query where

a where clause

occurs in your query and that where clause is something based on a tag, it should be able to find it pretty quickly because it hits the index and does whatever. The thing in the where clause is based on a field,

it will have to do a scan over whatever the time range is that you're looking at. And the thing is, like, depending on what you're doing, that could be totally fine. But

right now,

when you think about laying out your data, you have to think about what should be a measurement, what should be a tag, and what should be a field.

The thing

I'd like to remove from people's consideration

is whether or not something should be a tag or a field.

Basically, measurements are a good way to organize data.

In Iox, it's basically a measurement is a table. And conceptually, I think this is how our users

view it, which is a measurement is a table,

tags and fields or columns

where it just so happens that tags are columns that have

predefined indexing on them.

Basically, making it so that you don't have to worry about what's a tag or a field, and you can just write your data in, and you don't have to worry about whether or not you're,

you know, creating some cardinality explosion or whatever. That's the primary thing that I think I'd like to make it so people don't have to worry about.

But for now,

the system that's in place, you do have to consider what's gonna be a tag and what's gonna be a field.

And if it's something where you have, you know, unique values going in, at this stage, that has to be a field. Otherwise, the system's not gonna work.

And another interesting aspect

and an early design decision that you made was to build your own query language for Influx DB. And I'm curious how that has played out in terms of the overall adoption and the user feedback

and any sort of challenges and friction that that creates in terms of integrating within Flex DB and just what you drew on for inspiration of the language and your motivation for creating your own query syntax versus just using SQL out of the box.

SQL out of the box isn't really an actual thing given the variety of dialects, but sort of you get what I mean. Right.

We had InfluxQL.

The thing is InfluxQL

looks kind of like SQL, but it's not SQL. And for people who are familiar with SQL,

it's not SQL in in ways that become more and more frustrating over time as you work with it. Right? To create like a just a really trivial

query, it's super, super easy and I would say in many cases, it's easier than

what raw SQL looks like.

But

there are many ways in which it's not SQL, which if you start to do more advanced things, and actually if you're somebody who's just like really, really familiar with SQL, you'd probably be annoyed by it right away, right, with InfluxQL

just because it is different, although it looks

not quite different.

The thing is like as a language, 1,

there are ways in which it was different from SQL that we knew were frustrating to some users

And there were limitations

in the language and then the actual execution engine

where

there were all sorts of feature requests we had that we wanted to add in, but we couldn't figure out

how to add those into the language and have it make

sense, like, semantically or syntactically or whatever.

So we arrived at a point where we're like, okay, we need to do something

drastic with the language if we wanna, like, enable these new features. We couldn't just, like, add on to this thing. Right?

It was basically like we didn't have a solid enough foundation

to add to.

So the debate at that point was, do we adopt a more standards compliant SQL

flavor? Right? To me, like the dialect you're gonna adopt is either MySQL or Postgres

or

do we create,

you know, something new?

And

my view was that for time series data

and this kind of like event data,

I thought a functional style,

it made a bit more sense to me, which is basically like you have time series or you have the stream of data and you apply a function to it, which gets piped out to another function, which gets piped out to another function. Right? You just have like this series of transformations that happen until you get to data on the other end,

which I thought like made a lot of sense.

So, there was 1 side, which is I kinda had a preference for doing something like that. And then the other side of it was really just kind of a pragmatic view of,

okay,

we have this system written in Go,

if we're gonna add a legitimate like SQL processing layer to it,

what does that work look like? Right? But at that time there weren't that I knew of any

MySQL or Postgres parsers or query engines

written in Go that we could use.

Right? So basically at that point we're talking about writing

a query parser,

a planner,

an optimizer,

and an execution engine,

which

for a legitimate SQL implementation is gonna take you 5 plus years. Right? There's just no way of getting around the fact that that's a difficult thing to do and it takes time.

So

picking that didn't feel like a faster way to go. My thought was like, if we do our own language, we can kind of limit the scope.

So hopefully we can get something done faster.

But also at the same time,

I wanted a language that was actually a full blown scripting language. Like I wanted to execute code within the database because I wanted users to be able to

specify

behavior and functionality

that was well outside the realm of a declarative

query language,

which is also both good and bad. The nice thing about declarative query language like SQL is you can write planners and optimizers that do a bunch of magic for you. Right? And those do those are like approval and like all this other stuff. But again,

the the Java writing query optimizers is like 1 that's never done, and there are literally like a 1000000 different ways to do it. So whereas, like, if you have a scripting language,

basically anybody can write code in any 1000000 different ways, which could have different performance impacts and stuff like that. But

that's what I wanted was a language

where

they could specify more than just declarative queries that could be executed within the database.

So we landed on we're gonna do a functional language. As far as what it looks like, it looks basically like JavaScript.

The only difference is the pipe forward operator, which we borrowed from Elixir.

I mean, there are proposals to add pipe forward to JavaScript, but I don't think that that's ever gonna happen, but so yeah. It looks like JavaScript because we wanted it to be something that looks a bit more familiar that people could pick up easily.

That experience has been

both a learning experience because that ended up being

and still is

way, way more work,

which in again, in retrospect should have been obvious, but it's way more work than, you know, any of us anticipated it would be. And that just is continuing.

But the thing is, what we found is that some users love it. Like the functionality that they've been able to get out of it, that they couldn't get out of Influx QL

has been huge

and that, you know, they've written literally tens of thousands of lines of code in this.

Some users hate it, won't adopt it, won't use it. Right?

So the learning experience we've had there is

where we're at now is what we want to do is we want to bring in more options for people to query their data and to process their data in languages

that they want to use. So Flux is 1 way, right? It's the way we have right now

in our cloud platform to query and process your data. We also now support Influx QL in our cloud platform,

but

we're adding SQL support. So SQL

gets added,

I think, a

week or 2

is when that launches.

So

we will have support for, you know, querying your data via SQL.

We are also going to be working on adding support for other programming languages, specifically

Python

and TypeScript

is what we'd like to add support for. Because really like what the platform is about

is about storing and querying your data, you know, at scale, but also processing

that data for monitoring, learning, and ETL and basically tying all of those pieces together.

So if we can enable people to pick up programming language and query tools that they're comfortable and familiar with

and just make it easier for them to do that, then I think that'll be good.

A lot of interesting things in there. We're starting to get towards the end, so I'm not gonna dig too deep into them. 1 of the other things that I wanted to

ask about and spend some time on is your thoughts on the overall

role and benefits

of open source in your product strategy

and just your reflections

on

the

successes that you've been able to realize because of that bottom up developer adoption that the open source Influx engine has provided to you and just the network effects of the other components of the tick stack, particularly with Telegraph not being tied specifically to Influx DB and just being another entry point for people to be able to discover your other products?

I mean, open source is a core part of our product strategy, obviously.

Telegraph is interesting because it's the most successful open source project we have by a decent margin. And

we made the decision early on that it was not going to be tied specifically to InfluxDB.

And this was again like a pragmatic decision because at the time when we created Telegraph, which was first released was in June of 2015,

there were already a number of open source data collectors that were popular and basically the reaction from many people was why the hell would you do this? This? Why do we need another data collector?

And our thesis was that we wanted something that was tied that was purpose built for InfluxDB that took advantage of InfluxDB's

data model and could work seamlessly with it. But we realized as a data collector,

all of the value in a data collector lies in the plugins that exist for it to pull data from other systems.

So and the only way we are gonna get a decent number of plugins is if we built a thriving community around it. So we said we're not gonna tie it to InfluxDB specifically. It's gonna be

useful on its own as an independent project and that will hopefully drive more people to contribute to it and whatever. There are some other things in the creation of that project, like how the code was structured

in a way to make it really easy for an outsider to come in and contribute a plugin without having to know the entirety of the project. But basically that played out really well over, you know, being 6 years, which is, you know,

thousands of people have contributed to Telegraph including competitors

and that has increased the value of Telegraph as a whole.

Now,

I will say

like maybe people discover Inflex DB as a result of it. There are certainly people who use Inflex DB because Telegraph is such a good data collector. But again, like it's a good data collector because it has such a big community contributing plugins and stuff like that. So it's not necessarily I view it as, like, a great marketing channel

as much as I view it as just a great piece of technology that InfluxDB works well with.

As far as like a bunch of the other stuff, like

open source for Influx has been a tricky road because, you know, initially the vision was that the open source project was gonna be distributed horizontally scalable,

totally open source.

And in 2016,

I changed that because

we had to figure out how to build a business

and I couldn't see a way

to build 1 because we had said, like, 8 months prior that we would do support and stuff like that, and nobody wanted to pay for it. We had a basic hosting platform, but again, that wasn't really taking off at that point.

So we needed something differentiated that people would actually pay us money for. So we ended up saying, we're going to hold out, scale out

in the commercial version

and try and build a, build a business that way. You know, obviously people were upset about that change, but we also had immediate commercial interest

and that is what most of our business is built on at this point.

Going forward, like, we're always looking at how do we add more to the open source. And really, I'd say this 4th

phase of InfluxDB is really about that. It's about 2 things actually in terms of the core database itself.

1,

adding more distributed capabilities to the open source project

and 2,

integrating more seamlessly and more tightly with

other open source technologies and tools. Right? So, its query language

natively is the Postgres dialect of SQL,

which is actually built on top of project called Data Fusion, which is part of the Apache Arrow project.

Persistence format is no longer specific to Influx. It's Parquet,

which is again now part of the Apache Arrow project.

So we're really just working on building,

you know, this core is something that integrates seamlessly with other things in the OLAP and data warehousing workspace

and builds on standards that people already know and love.

Yeah.

Adopting Arrow and Parquet as core elements of the database engine, I can definitely envision a huge number of different network effects and technological benefits and sort of architectural patterns that can evolve out of that. So definitely interested to see where that goes in sort of the future releases. I'm definitely gonna be keeping a close eye on that.

In terms of

just the

overall adoption and usage of Influx and the overall tech stack and your platforms that you've built around all of that. I'm wondering what you have seen as some of the most interesting or innovative or unexpected ways that it's all been deployed and used?

The database itself gets used in all sorts of interesting ways, but I kind of, like, hoped and expected that all along the way, like, you know,

obvious use cases like server monitoring, but more interesting use cases like monitoring sensor data of all different kinds, right, whether it's like, wind turbines or solar panels or

brace cars

have been

monitored with that. 1 of the more interesting ones that I saw recently was, like, some observatory in South America that's

tracking a bunch of data on it. I've seen other use cases as well, so that's interesting.

Telegraph has been super interesting to watch.

It's become more than just like a collection agent. It's become this thing where people, like, proxy data through it, and it becomes, like,

this other piece of infrastructure

that they run-in

addition to Implex or sometimes obviously even without Mplux.

And I see Telegraph being used kind of like at the edge in interesting ways,

which I'd like to bring in as a more first class concept in the future versions of InfluxDB.

And in your own experience

of building out these products and driving a lot of the technology behind it and building a business on top of all of these systems, what are some of the most interesting or unexpected or challenging lessons that you've learned in that process?

I mean, this is like a lesson you learn all the time. Basically, everything takes longer than you expect it to take.

Like, it just depends on what your personal bent is. Being an entrepreneur, I'm an optimist

in many ways. So

I have been surprised by

some cases where there are technologies that I did not think would take off that did or other things that did that

I thought would take off that didn't

thing you don't expect as you're, like, coming up with this new technology or at least I didn't expect as I'm coming up with these, like, new version of the database,

you know, new query languages or whatever,

is you don't

account for all the other things that are completely outside of your control. They're basically out in the industry. They're going to change around you. And this is why like

development time and how long it takes you to get something to market is so critical. It's because if you take too long,

things will change around you in ways that you didn't actually anticipate when you were first coming up with this plan. So by the time you get it out there, it could be irrelevant

or the ground could have shifted beneath you and stuff like this. I saw this John Carmack was I don't normally watch Joe Rogan, but I did watch this 1 interview with John Carmack,

you know, a guy who started id, the creator of Doom and Quake, and bunch of other things.

And he had mentioned at some point late in the interview how, like, you know, his last project at id was this 1 that they, like, dragged on for, like, 6 years or something like that. And the thing that was surprised him was that was that, like, the technology

of these games and stuff like that shifted beneath them and actually advanced beyond them

while they were too busy trying to create this thing. And for some of the things that we've taken a bit longer to release than I would have liked,

I've seen developments

by other players in the industry that were basically like they just came out faster than I expected them to.

And in terms of people who are looking at time series use cases, they're considering Influx DB or some of the associated tools that you've built around it. What are the cases where Influx and the Influx platform are the wrong choice?

The wrong choice.

Well, I definitely wouldn't use it to store tracing data or, like, really high cardinality event data right now. It's not good for

that.

Well, it's not good for super big data use cases, like, you know, data use cases where you have, like, petabytes and data and stuff like that. Not good, not useful.

There are a number of larger scale, like,

complex analytical workloads that it's currently not

that good at, but that's the 1 that I'm really hoping to change

as soon as in the nearer term.

And I guess right now, like, I would say log data

you can actually store log data in Influx. And depending on what you're doing, it could be very, very good. We actually have users who've used it extensively for log data. But again, like, there are probably other things that are better out there now than Influx.

1 of the things that we didn't touch on yet that I wanted to discuss briefly is just some of the life cycling elements of the data in in Influx DB system because of the fact that you are dealing with time series data. It's always changing. It's always progressing forward.

How do you deal with historical data and either aging it out of the system or doing periodic compaction where you might say, I'm storing 1 second ticks

for 1 month or for 1 week. And then at that 1 week mark, I'm going to roll that up into 10 minute intervals or 1 week intervals or things like that.

Yeah. So the current version of Influx 1.x and 2.x

have essentially you when you create a database, you can set a retention policy to say how long to keep the data around, around. And that works

basically in large blocks

or whatever. We call them shards in the underlying storage engine. So essentially you can say, I wanna keep this data around in this database for 7 days or 30 days or whatever.

And essentially, like,

it organizes data into shards, which later

you can just drop a bunch of files. Right? You don't have to, like, rewrite a bunch of things or do compactions in order to get rid of that old data.

For,

you know, summarization

and downsampling and stuff like that, in 1.x you have something called continuous queries.

In 2.0, you have Flux

tasks.

So you can basically write down sampling rules as Flux tasks which write that data into

a separate database or separate bucket, is what it's called, that could have a different retention policy.

In Iox, InfluxDB Iox, which is the new like core of the database,

it actually

has a life cycle that's tied to object storage. So the goal there is

this project manages the data life cycle of ingest where you have it in memory,

batching it up together to ship it off to object storage in larger blocks and evicting it from memory, but then also potentially pulling it back from object storage

and using the locally attached disk is essentially like a cache

for object storage.

So managing that data life cycle of what's in memory, what's in object store, what's on the local disk is basically like

a big part of what that project is all about. So that ideally you can have

petabytes of data,

but not have to worry about those petabytes of data being stored in InfluxDB

Iox. They're obviously all in object storage. So InfluxDB Iox can focus on the parts that it really cares about, which is managing that data life cycle to make sure all data gets shipped to object storage eventually,

answering queries in real time for dashboards and monitoring and learning.

And then

if you wanna do large scale data processing,

because it's all just Parquet files and object storage,

you can have a big data processing system like EMR or whatever, and just point it at objects or and do that totally out of band of the, you know, production system.

So that's where things are going.

And I know we've already discussed a number of different plans that you have for the near to medium term with this rewrite of the core of the database engine. But I'm curious if there are any other aspects of what you're building at Influx that you have planned for that near to medium term that you wanna discuss.

I think I already mentioned them earlier, which is basically additional language support,

you know, in terms of the scripting languages and the query languages.

And the stuff in IX is basically gonna lead to a bunch of this whole new set of features that we'll be launching as part of our cloud product and then later

as part of our enterprise offering. So

Well, for anybody who wants to follow along with you and get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tool in our technology that's available for data management today. I still think the integration piece is hard. I think there there are a 1,000,000 different projects out there to do these things

and tying them together in a way that's cohesive and basically

manageable,

observable, manageable, like iterable, so you can, like, create new bits of code and deploy them and test them or whatever. Like, all of that different stuff, like, there doesn't seem to be a good way to tie all those things together.

Right? I think it's kind of like a result of, you know, you have a bunch of different open source projects or cloud provider projects, which all have individual like best of breed solutions,

but then when it comes to tie them together,

like, it's on every individual developer to be a systems integrator essentially, and I think it would be nice to see some projects that really

focus on that kind of like developer experience

in terms of integrating some of these pieces together and making

perennial problem. It seems like we start to inch closer

together become obsolete.

And then it just starts all over again. Right.

Right. Exactly.

Well, thank you very much for taking the time today to join me and share the work that you've been doing at InfluxData. It's definitely a very interesting set of projects and things that I've been keeping an eye on for a while. So appreciate all the time and effort you've been putting into

the time series ecosystem

and the overall monitoring space. So thank you again for all of that, and I hope you enjoy the rest of your day. Thank you.

You too. For listening. Don't forget to check out our other show, podcast dot init atpythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts atdataengineeringpodcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links