Building A Reliable And Performant Router For Observability Data

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

And for your machine learning workloads, they just announced dedicated CPU instances.

Go to data engineering podcast.com/linode,

that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

And you listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations

such as O'Reilly Media, Dataversity,

Corinium Global Intelligence, and Data Council.

Upcoming events include the O'Reilly AI Conference, the Strata Data Conference, the combined events of the Data Architecture Summit in Graforum,

and Data Council in Barcelona.

Go to data engineering podcast.com/conferences

to learn more about these and other events and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macy. And today, I'm interviewing Ben Johnson and Luke Steensson about Vector, a high performance open source observability data router. So, Ben, can you start by introducing yourself? Sure. My name is Ben. I am the cofounder CTO of Timber. Io. And, Luke, how about yourself? Yeah. I'm Luke Steenson. I'm an engineer at Timber.

And, Ben, going back to you, do you remember how you first got involved in the area of data management?

Yeah. So, I mean, just being an engineer, obviously,

you get involved in it through observing your systems.

And so

before we started Timber, I was an engineer at SeatGeek. We dealt with all kinds of observability challenges there. And, Luke, do you remember how you first got involved in the area of data management?

Yeah. So at my last job, I ended up working with Kafka,

quite a bit in a in a few different

contexts.

Sorry. I ended up getting getting pretty involved

with that project,

leading some of our internal

stream processing

projects that we were trying to get off the ground, and I just found it you know, it's a very interesting space, and the more that you

dig into a lot of different engineering problems, it it does it ends up boiling down to to managing your data, especially when you have a lot of it. It kinda becomes the the primary challenge,

and limits a lot of what you're able to do. So kind of the more

tools and techniques you you have to address those issues and and use as kind of design tools, the the further you can get, I think. And so you both work at timber.io,

and you have begun work on this vector project. So I'm wondering if you can explain a bit about what it is and the overall reason that you had for creating it in the first place. Yeah. Sure. So on this on the most basic sounds, vector is an observability

data router,

and it collects data from anywhere in your infrastructure, whether that be a log file over a TCP socket. It could be stats d metrics.

And then

Vector is designed to ingest that data and then send it to multiple storages. And so

the idea is that it

is sort of vendor agnostic and collects data from many sources and sends it to many syncs. And the reason we created it

was really for a number of reasons, I would say.

1, you know, being an observability company.

And then when we initially launched Timber,

it was a hosted logging platform and needed a way to collect our customers' data. We

tried writing our own,

initially in Go that was very just kind of specific to our platform.

That was that was very difficult.

We

started using,

off the shelf solutions

and found those also to be difficult. We were getting a lot of support requests. It was hard for us to contribute and debug them. And then I think in general, you know,

our our ethos as a company is we wanna create a world where

developers have choice

and aren't locked in to specific technologies,

are able to move with the times, choose best in class tools for the job,

And that's kinda what prompted us to start Vectors.

That vision, I think, is enabled by an open collector that is vendor agnostic

and meets a quality standard,

that,

makes people wanna use it. And so,

it looks like we have other areas in this

podcast where we'll get into some of the details there.

But we really wanted to raise the bar on the open collectors and start to give control and ownership back to the people, the developers that were deploying Vector. And as you mentioned, there are a number of other off the shelf solutions that are available. Personally, I've had a lot of experience with fluent d, and I know that there are other systems coming from the elastic stack and other areas. I'm curious, what are some of the

tools that you consider to be comparable or operating in the same space

and any of the ones that you've had experience with that you found to be lacking? And what were the particular areas that you felt needed to be addressed that weren't being handled by those other solutions? Yeah. I think that's a really good question. So first, I would probably classify the collectors as either open or not.

And so, typically, I wouldn't we're not too concerned with vendor specific collectors, like the Splunk forwarder

or any other sort of, you know,

thing that just ships data to 1 service.

So I'd say that, you know, in the category of just comparing tools,

I'll focus on the ones that are open. Like you said, Fluentd,

Filebeat, Logstash, like, I think it's questionable that they're completely open, but I think we're more comparable to those tools. And then I'd also say that, like,

we're

we typically try to stay away from like, I don't wanna say anything

negative about the projects because I I I a lot of them were,

pieces of inspiration for us.

And so, you know, we respect

the fact that they are open and they were solving a problem at the time. But I'd say 1 of the projects that that really,

we thought was a great alternative and inspired us is 1 called Cernan.

It was built by Postmates. It's also written in Rust, and that kinda opened our eyes a little bit. Like a new bar, a new standard that you could set with these

these collectors.

And,

we actually know Brian Traitwine. He was 1 of the developers that worked on it.

He's been really friendly and helpful to us. But the sort of thing that the reason we didn't use Sernin is, like, 1, it's it was created out of necessity of Postmates, and it doesn't seem to be actively maintained. And so that's that's 1 of the big reasons we started Vector. And so I would say that's

that's something that's lacking is, like, you know, a project that,

is actively maintained and and is in it for the long term.

Obviously, that's that's important. And then in terms of, like, actual specifics of these projects,

there's a lot that I could say here. But, you know,

on 1 hand, we've seen a trend of certain tools that are built for a very specific storage and then sort of, like, backed into supporting more syncs. And it seems like the, like, incentives and sort of fundamental,

practices of those tools are not aligned with many disparate storages that kind of ingest data differently. Differently.

For example, like, the fundamentals of, like, batching and stream processing. I think thinking about those 2 ways of, like, collecting data and sending it downstream,

kind of don't work for every single storage that you wanna support. The other thing is just the obvious ones, like performance,

reliability,

having no dependencies. You know, if you're not a strong Java shop, you probably aren't comfortable deploying

something like Logstash and then managing the JVM and everything associated with that. And, yeah.

I think, and and another thing is we wanted a a collector that was, like, fully vendor agnostic and vendor neutral.

And,

a lot of them don't necessarily fall into that bucket. And as I said before, like, that's something we really strongly believe in is an observability world where developers can rely on a best in class tool that is not

biased and has aligned incentives with the people using it because there's just so many benefits that stem off of that.

And on the point of

sustainability and openness, I'm curious

since you are part of a company and this is in some ways related to the product offering that you have, how you're approaching

issues such as product governance and sustainability

and ensuring that the overall direction of the project is remaining impartial and open and trying to foster a community around it so that it's not entirely reliant on

the direction that you try to take it internally and that you're

incorporating

input from other people who are trying to use it for their specific use cases?

Yeah. I think that's a great question.

So 1 is we wanna be totally transparent on everything. Like, everything we do with Vector,

discussions, specifications,

road map planning,

it's all available in GitHub.

And,

so nothing is private there, and we want

Vector to truly be an open

project that anyone can contribute to. And then

in terms of, like, governance and sustainability, like, we try to do a really good job

just maintaining the project. So number 1 is, like, good issue management,

like, making sure that that's that's done properly,

helps people, like, search for issues, helps them understand, like, which issues need help, like, what are good first issues to start contributing on.

We wrote an entire contribution guide and actually spent good time and put some serious thought into that so that people understand, like, what are the principles of Vector and, like, how do you get started. And then I think the other thing that really sets Vector apart is, like, the documentation, and I think that's actually very important for sustainability

and,

helping to

it's it's really kind of, like, a reflection of your project's respect for the users in a way,

but it also serves as a really good opportunity to, like, explain the project and help people understand, like, the internals of the project and how to how to contribute to it. So

it really kind of all comes together, but I'd say, yeah, the number 1 thing is just transparency

and making sure

everything we do is out in the open. And then in terms of the use cases that Vector enables,

obviously, 1 of them is just being able to process logs from a source to a destination. But in the broader sense, what are some of the ways that Vector is being used both at Timber and with

other users and organizations that have started experimenting with it beyond just the simple case? So first, like, Vector's new, so we're still learning a lot as we go. But, you know, the the core use cases, the business use cases we see

is there's everything from

reducing costs. Vector is quite a bit more efficient than most collectors out there. So just by deploying Vector,

you're gonna be using less CPU cycles, less memory, and you'll have more of that available for the app that's running on that server. Outside of that, it's like the fact that Vector enables,

choosing multiple storages and and the storage that is best for your use case

lets you reduce cost as well.

So, for example, you know, like, if you're running an ELK Stack, you don't necessarily wanna use your ELK Stack for archiving. You can use another

cheaper durable storage for that purpose and sort of take the responsibility

out of your ELK Stack, and that

reduces cost in that way. So I think that's, like, an interesting way to use vector.

Another 1 is, like I said before, reducing lock in. That use case is is so powerful because it gives you agility,

choice, control,

ownership of your data.

Transitioning vendors is a big use case we've seen.

So many companies we talk to are bogged down and locked in to vendors, and they're

tired of paying the bill, but they don't see a clean way out. And, like, observability is an is an interesting problem because it's not just

technology coupling. Like, there are human workflows that are coupled with the systems you're using.

And so transitioning out of something that maybe isn't working for your organization anymore,

requires a bridge. And so vector is a really great way to do that. It's like deploy vector,

continue sending it to whatever vendor you're using,

and then you can, at the same time, start to try out other storages

and, like, other setups without disrupting,

like, the human workflows in your organization. And I I could keep going. There's data governance.

We've seen people, you know, cleaning up their data and forcing schemas.

Security and compliance, you have the ability to, like, scrub sensitive data at the source before it even goes downstream.

And so, you know, again, like, having a good open tool like this is so incredibly powerful

because of all of those

use cases that it enables and, like, lets you take advantage of those when you're ready. In terms of the actual implementation of the project, you've

already mentioned in passing that it was written in Rust. And I'm wondering if you can dig into the overall system architecture and implementation

of the project and some of the ways that it has evolved since you first began working on it. Like you said, Rust is I mean, that's kind of the first thing everybody looks at, certainly in Rust.

And kind of on top of that, we're we're building with the, like, the

Tokyo asynchronous IO,

kind of stack of of libraries and tools within the Rust ecosystem.

Kind of from the beginning, we

we've started Vector

pretty simple architecturally,

and we're kind of we have an eye on where on where we'd like to be, but we're we're trying to get there very, very incrementally.

So at a high level,

each of the internal components of vectors is generally either

a source, a transform, or a sync.

So so probably familiar terms if you if you've dealt with this type of tool before, but sources are something that helps you ingest data and transforms,

different things you can do, like parsing

JSON data into, you know, our our internal data format,

doing

regular expression,

value extracting, like Ben mentioned, and enforcing schemas,

all kinds of stuff like that.

And then syncs, obviously, which is where we

will actually forward that data downstream

to some external storage system or

service or or something like that.

So that those are kind of the high level pieces. We have some different patterns,

around each of those, and, obviously, there's different

different flavors. You know, if you're if you have

a UDP syslog source, that's that's gonna look and,

operate a lot differently than a file tailing source.

So there's a lot of

there's a lot

of different styles of implementation, but they all we we kind of fit them into those 3 buckets,

of source transform and sync.

And

then the way that you configure vector, you're you're basically building a data flow graph where where data comes in through a source,

flows through any number of transforms, and then

down,

the graph into a sync or multiple syncs.

We try to keep it

as flexible as possible.

So you can

you can pretty much build, like, an

arbitrary

graph,

of of data flow. Obviously, there are gonna be situations where

that that isn't you know, you you could build something that's that's pathological or won't perform well, but we we kinda leaned

towards giving users the flexibility to do what you want. So if you want to,

you know, parse something as JSON and then use a regex to extract

something out of 1 of those fields,

and then enforce a schema and drop some fields. You can kinda chain all these things together, and you can

you can kind of have them

fan out into different transforms and merge back together into a single sync or

feed 2 syncs from the same transform output,

all that kind of stuff. So, basically, we we try to keep it very flexible. We definitely don't

advertise ourselves as, like, a general purpose stream processor, but there's a lot of influence, from working with those kinds of systems,

that has found its way into the design of vector.

Yeah. The ability to

map together

different components of the overall flow is definitely useful. And I've been using fluent d for a while, which has some measure of that capability, but it's also somewhat constrained in that

the logic

of the actual pipeline flow is dependent on the order of specification and the

configuration document,

which is sometimes a bit difficult

to understand exactly

how to structure the document to make sure that everything is functioning as properly.

And there are some mechanisms for being able to

route things slightly out of band with particular syntax, but just managing it has gotten to be somewhat complex. So when I was looking through the documentation for Vector, I appreciated the

option of being able to simply say that the input to 1 of the steps is,

linked to the ID of 1 of the previous steps so that you're not necessarily constrained by order of definition and that you can instead just

use the

ID references to ensure that the flows are being Yeah. That was definitely

something that

we spend a lot of time thinking about and we still spend a lot of time thinking about. Because, you know, if you kinda squint at these config files, they're they're kind of like a program that you're writing. You'd know, you'd have data inputs and processing steps and and data outputs. So

you

you want

to make sure that that flow is clear

to people,

and you also wanna make sure that, you know, that there aren't gonna be

any surprises,

you don't want. I know a lot of tools, like you mentioned, to have this as kind of an implicit part of the way the config is written,

which can be difficult to manage. We wanted to make it as

explicit as possible,

but also

in a way that is

still relatively

readable

from a, you know, just when you open up the config file.

We we've gone with a pretty simple TOML format, and then like you mentioned, you just kind of mentioned

you just kind of specify

which input each component should draw from.

We have had some

kind of ideas and discussions about what,

our own configuration file format would look like. I mean, we've what we would love to do eventually is make

these kind of pipelines as much as as pleasant to write as something like like a bash pipeline,

which we think that's another really powerful inspiration for us. Obviously, they have their limitations,

but the things that you can do just in a bash pipeline with a,

you know, you have a log file, you grab things out, you run it through.

There's all kinds of really cool stuff that you can do in in, like, a lightweight way,

and that's something that we've we've put a little thought into.

How can we be

as close to that level of, like, power and flexibility,

while avoiding a lot of the limitations of, you know, obviously being a single tool on a single machine,

and,

you know, I don't wanna get into all the the gotchas that come along with writing bash 1 liners.

Obviously, there there are a lot, but it it we want it's something that we wanna

kinda take

as much of the good parts from as possible.

And then in terms of your decision process for the actual

runtime implementation

for both the actual

engine itself as well as the scripting layer that you implemented in Lua.

What was the decision process that went into that as far as choosing and settling on Rust? And,

what were the overall considerations and requirements that you had as part of that decision process?

So from a high level, the things

that

we thought were most important,

when writing this tool, which which is obviously gonna run on other people's machines and

hopefully run on a lot of other people's machines.

We wanna be,

you know, respectful of the fact that they're, you know, willing to put our tool on a lot of their

their boxes. So we don't wanna use a lot of memory. We don't wanna use a lot of CPU.

We wanna be as

resource constrained

as possible.

So so efficiency is a big,

or was a big

point

for us,

which Rust obviously gives you the ability to do. There's you know,

I I'm a big fan of Rust, so I could probably talk for a long time about all the all the wonderful features and things. But, honestly,

the fact that it's a it's a tool that lets you write, you know, very efficient programs,

control your memory use pretty tightly,

that's somewhere that we, I think, have a pretty big advantage over a lot of other tools. And then just I I was the first engineer on the project, and I know Rust quite well. So just kind of the the

the human aspect of it, it it made sense for us. We're lucky to have a a couple people at Timber who are who are very

very good with rest, very familiar and involved in the community.

So it has worked out I think I'd say it's worked out very well. From the

embedded scripting perspective,

Lua was kind of an obvious

obvious first

choice for us.

There's

very good precedent for for using Lua in this manner,

For example, in NGINX and,

HAProxy,

they both have,

Lua environments that lets you do a lot of amazing things that you would maybe never expect to be able to do with those tools. You can write a little bit of Lua,

and there you go. You're all set. So Lua is very much built for

this purpose. It's it's kinda built as an embedded language, and there were

a mature implementation,

of bindings for us. So it didn't take a lot of work

to integrate Lua, and we have a lot of confidence that it's a reasonably performant,

reliable

thing that we can kind of drop in and expect to work. That being said, it's it's definitely

not

the end all be all. We know that while people can be familiar with Lua from a lot of different areas where it's used, like gaming and or game development and,

like I mentioned, some observability tools or infrastructure tools,

We are interested in supporting more than just Lua. We actually have a work in progress,

JavaScript

transform that will allow people to kind of use that as an alternative

engine,

for transformations.

And then a little bit longer term, we we this is we kinda want this area to

mature a little bit before we dive in, but the the WASM space has been super interesting. And I think that, particularly from a

flexibility and performance perspective, could give us a platform to do some really interesting things in the future. Yeah. I definitely

think that the web assembly area is an interesting space to keep an eye on because of the fact that it is in some ways being targeted as sort of a universal runtime that multiple different languages

can target.

And then in terms of your choice of Rust, another benefit that it has when you're discussing the,

memory efficiency is the guaranteed memory safety, which is certainly important when you're running it in customer environments because, that way, you're less likely to

have memory leaks or accidentally crash their servers because of a bug in your implementation.

So I definitely think that that's a a good choice as well. And then 1 other aspect of the choice of Rust for the implementation language that I'm curious about

is how that has impacted the overall uptake of users who are looking to contribute to the project, either because they're interested in learning Rust, but also in terms of people who aren't necessarily familiar with Rust and any barriers that that may pose?

It's something that's kind of hard it it's hard to know because, obviously, we can't we didn't

can't inspect the alternate timeline where we we wrote it in Go or something like that.

I would say that

there's kind of there's there's ups and downs from a lot of different perspectives. From,

like, a from an developer interest perspective,

I think Rust is is something that a lot of people find interesting now. The

the the sales pitch is a good 1, and then and a lot of people find it compelling.

So I think it's definitely,

you know, it's it's caught a few more people's interest because it happens to be written in Rust.

We we try not to

push on that too hard because, of course, there's there's the other set of people who who do not like Rust and are very tired of hearing about it. So,

you know, we we love it, and we're very happy with it, but we try not to make it,

you know, a primary marketing point or anything like that.

But I I I I think it it does it does help, to some degree.

And then from a contributions

perspective,

again, it's hard to say for sure, but I do know from experience that we have had,

you know, a handful of people kinda pop up from the open source community and and give us some some really high quality contributions,

and we've been really happy with that. Like I said, we can't really know

how that would compare to,

if we had written it in a language that more people are proficient in. But the contributions from the community that we have seen so far

have been, like I said, really high quality,

and we're really happy with it. The the JavaScript transform that I mentioned is is actually something that's a good example of that. We had a contributor come in and and do a ton of really great work to to make that a reality, and it it's something that we're pretty close to being able to merge and and ship. So that's something that I I definitely shared a little bit of that concern. I

was like, I know Rust at least has the reputation of being a more difficult to learn language,

but the the community is there. There's a lot of really skilled developers

that are interested in Rust and, you know, would love to have an open source project like Vector that they can contribute to.

And and we've seen we've definitely seen a lot of benefit from that. In terms of

the internals

of vector,

I'm curious how the

data itself is represented

once it is ingested in the sync and how you process it through the transforms as far as if there's a particular data format that you use internally in memory and also any capability for schema enforcement

as it's being flowed through vector out to the sinks?

Yeah. So right now, we have our own

internal

our own in memory

data format. It's it's kind of it's a little bit I don't wanna say thrown together, but it's something that's been

incrementally evolving,

pretty rapidly as we build up the number of of different sources and syncs that we support. This was actually something that we deliberately

kind of set out to do when we were building vectors. We didn't wanna start with the data model. You know, there are some projects that do that, and that's I think there's definitely a space for that. The the data modeling in the observability space is, is not always the best. But we explicitly kinda wanted to leave that to other people, and we were going to start with the simplest possible thing and then kind of add features up as we found that we we needed them in order to better support,

the data models of the the downstream

syncs and the transforms that we wanted to be able to do. So from from day 1, the the data model was basically just string. You know, you send us a log message, and we represent it as a as a string of characters. Obviously, it's it's grown a lot since then, but we we basically now support we call them events internally. It's kind of our our vague name for everything that flows through the system. Events can be a log or they can be a metric. If they're a metric, we we support a number of different types, including counters, gauges,

kinda all your standard types of metrics from, like, the the stats d family of tools. And then logs,

they can be just a message.

Like I said, just a string. We still support that as much as we ever have, but we also support more structured data. So right now, it's

a flat

map of string you know, a map of string to something. We have a variety of different types that the values can be,

and that's also something that's kind of growing as we wanna better support different tools. So right now, it's kind of like non nested JSON ish representation,

in memory. We don't actually serialize it to to JSON. We support a few extra types, like time stamps and and things like that that are important for our use case. But in general, that's that's kind of how you can think about it. We have,

we have a protocol buffers

schema for

that data format that we use when we serialize

to disk, when we do some of our on disk buffering. But that

is I wouldn't say that's necessarily the primary representation. We when you work with it in a in a transform, you're you're looking at that that in memory

representation that, like I said, kinda looks a lot like JSON.

And that's something that we're we're kinda constantly reevaluating

and thinking about how we want to evolve.

I think kind of the next the next step in that evolution is to make it not necessarily just a flattened

map and move it towards,

supporting, like, nested maps, nested keys. So it it's gonna move more towards, like, an actual,

you know, full JSON

with better types and support for things like that. And on the reliability

front, you mentioned briefly the idea of disk buffering,

and that's definitely something that is necessary for the case where you need to restart the service and you don't want to lose messages that have been sent to an aggregator node, for instance.

I'm curious, what are some of the overall capabilities

in Vector that,

support this reliability objective

and also in terms of things such as malformed messages, if you're trying to enforce a schema, if there's any way

of putting those into a dead letter queue for reprocessing or anything along those lines. Yeah. Dead letter queue specifically isn't something that we support at the moment.

That's it is something that we've been thinking about, and and we do wanna come up with a a good way to support that. But currently, that isn't something that we have. A lot of transforms like the the schema enforcement transform,

will end up just just dropping the events that don't or it will. If it can't enforce that they do meet the schema by dropping fields, for example, it will it will just drop the event, which, you know, we're we recognize the

the shortcomings

there. I think 1 of the 1 of the things that is a little bit nice from an implementer's perspective about working in the observability

space as opposed to,

you know, the the normal data streaming world with application data,

is that people can be a little bit more there's more of an an expectation of best effort, which is something that we're willing to take advantage of a little bit in, like, the very beginning early stages of a certain feature or tool. But it but it's definitely a part of the a part of the ecosystem that we want to to push forward. So it's that's something that we we try to keep in mind as we build all this stuff is, yes, it might be okay. Now we may have parity with other tools, for example, if we just drop messages that don't meet a certain schema. But, you know, how can we how can we do better than that? Other tools that or other kind of things in the toolbox that you can reach for for this type of thing are I mean, the most basic 1 would be

that you can send data to multiple syncs.

So if you have a a kinda classic syslog

like setup where you're you're forwarding logs around, it's it's super super simple to just add

a secondary that will forward to both syslog aggregator a and syslog aggregator b. That's that's that's nothing particularly

groundbreaking, but it's something that is kind of the start. Beyond that, I mentioned the the disk buffer where we wanna make do as good a job as we can ensuring that we don't lose your data,

once you have sent it to us.

We are we are still a a single node

tool at this point. We're we're not a distributed storage system, so there are gonna be some inherent limitations in in the guarantees that we can provide you there. We do recommend, you know, if you if you really wanna make sure that you're not losing any data at all,

Vector

is going to it's not gonna be able to give you the guarantees that something like Kafka would. So we we wanna make sure that we work well with tools like Kafka,

that are gonna give you pretty solid,

you know, redundant, reliable,

distributed storage guarantees.

Let's see. Other than

those 2,

we writing

the tool

in Rust is, you know, kind of an indirect way that we wanna try to make it just as reliable

as possible. I think Rust

has a little bit of a reputation for making it

tricky to do things, you know, that the compiler is very picky and wants to make sure that everything you're doing is safe.

And that's something that

you can you definitely take advantage of to kinda guide you in in writing.

You mentioned, like, memory safe code, but it it it ex it's kind of expands beyond that into ensuring that every error that pops up, you're gonna you're handling

explicitly,

at that level or a level above, and things like that. It kind of guides you into writing more reliable code by default. Obviously, it's still on you to make sure that you're covering all the cases and and

things like that, but it it definitely helps.

And then moving forward, we're we're gonna spend a lot of time in in the very near future setting up certain kinda internal torture environments,

if you will,

where

we can run vector for long periods of time and kind of induce certain

failures in the network and, you know, the upstream services, maybe delete some data from underneath it on disk and that kind of thing.

Kind of fam if you're familiar with the

suite of database testing tools. Obviously, we don't have quite the same

types of invariance

that

an actual database would have.

But I think we we can use a lot of those techniques

to

kind of stress

vector and see how it responds. And like I said, we're gonna be limited in in what we can do based off of the fact that we're a single node system. And, you know, if you're sending us data over UDP,

there's not a ton of guarantees that we're gonna be able to give you. But to the degree that we're able to give guarantees,

we very much would like to do that. So that's that's definitely

it is a focus of ours, and we're going to be exploring that as much as possible.

And then in terms of the deployment topologies that are available, you mentioned 1 situation where you're forwarding to a Kafka topic.

But I'm curious what other

options there are for ensuring high availability

and just the overall uptime of the system for being able to deliver messages or events or data from the source to the various destinations?

Yeah. There are a few different

kind of

general

topology patterns

that we, you know, we we've documented and and we we recommend to people.

I mean, the simplest 1, depending on how your infrastructure is set up, can just be to run Vector on

each of your, you know, application servers or or whatever it is that you have,

and kind of run them there in a very distributed

manner and and forward to,

you know, if you are sending it to a certain upstream

logging service or or something like that. You can kind of do that where it's you don't necessarily

have any aggregation

happening in your

infrastructure.

That's pretty easy to get started with, but it it does have limitations.

If, you know, if you don't wanna allow outbound Internet access, for example, from,

from each of your nodes. The next kind of step, like you mentioned, is, you know, we would support a second

kind of

tier, of vector,

running maybe on a dedicated box, and and you would have a number of nodes forward to this more centralized aggregator node. And then that node would

forward to

whatever other, you know, syncs that you have in mind. That's kind of the second level of

complexity, I would say. You you do get some benefits and that you

have some

more power to

do things like aggregations

and sampling in a

centralized manner. There's gonna be certain things that you you can't necessarily do if you never

bring the data together.

And you can do that, especially if you're looking to reduce cost.

It's nice to be able to have that that aggregator node kind of have as a as a leverage point where you can bring everything together,

evaluate what is, you know, most important for you to forward to different places,

and and do that there.

And then kind of the for people who

want more

reliability than a, you know, a single aggregation node at this point, our recommendation is something like Kafka,

that that's going to give you

distributed

durable storage.

We that that is a big jump in in terms of infrastructure complexity.

So there's definitely room for some in betweens there that we're working on in terms of, you know, having a failover option.

Like, right now, you could put

a couple aggregator nodes behind a TCP load balancer or something like that. That's

not necessarily gonna be the best experience. So we're kind of

investigating

our options there to try to give people a good range of choices for, you know, how much they're willing to invest in the infrastructure and what kind of reliability and and robustness

benefits that they that they need. Another

aspect

of the

operational

characteristics of the system are being able to have visibility

into, particularly at the aggregator level,

what the current status is of the buffering

or any errors that are cropping up and just the overall system capacity. And I'm curious if there's

any current capability for that or what the future plans are along those lines. Yeah. We have

some we have a setup for for kind of our own internal

metrics at this point. That that is another thing that we're

kind of alongside,

the reliability

stuff that you mentioned that that we're really

looking at very closely right now and and what what we want to do next. We we've kind of the way we've set ourselves up, we have,

kind of an event based system internally where

it can be emitted normally as log events, but we also have the means to

essentially send them through something like like a vector pipeline where we can do aggregations,

and kind of filter and sample,

and do that kind of stuff to

get better insight into to what's happening in the process. So

we haven't gotten as far as I'd like down that road, at this point, but I think we have a pretty solid foundation to do some some interesting things.

And and it's gonna be definitely a a point of focus in the next, you know, few weeks.

So in terms of the overall road map, you've got a fairly detailed set of

features and capabilities that you're looking to implement. And I'm wondering what your

decision process was in terms of the priority ordering of those features and how you identified

what the necessary set was for a 1.0 release?

So initially, when we planned out the project, you know, our our road map is largely influenced by our past experiences.

You know,

not only supporting timber customers, but running our own observability tools.

And just based on the previous questions

you asked,

it was obvious to us that we would need to support those different type of deployment models. And so a lot of

so part of the roadmap was dictated by that. So you can see

later on the roadmap, we want to support stream processors

so we can enable

that sort of deployment topology.

And,

yeah, it was kind of it's it's very much evolving though as we learn and kinda collect data from

customers and their use cases.

We're actually going to make some changes to it.

But in in terms of a 1 0 release, like, everything that you see in the road map on GitHub, which are represented as milestones,

we think that

sort of represents

like, a 1 0 release for us represents something

a reasonably sized company

could deploy and rely on vector.

And so, you know, again, given our experience,

a lot of that is dependent on Kafka,

or some sort of some sort of more complex topology

as it relates to collecting your data and routing it downstream.

And then

in terms of the current state of the system, how would you characterize the overall production readiness of it and whatever and any features that are currently missing that you think would be necessary

for a medium to large scale company to be able to adopt it readily?

Yeah. So,

in terms of, like, a 1 release where we would we would recommend it to for, like, very stringent production use cases,

I think what Luke just talked about, internal metrics. I think it's really important that we improve Vector's own internal observability

and provide operators the tools necessary to monitor performance, set up alarms, and make sure

that they have confidence in vector. Internally,

the stress testing is also something that would raise our confidence in that. We have a lot of interesting

stress testing use cases that we want to run vector through, and I think that'll expose some problems. But I think getting that done would definitely raise our confidence. And then I think there's just some,

like, general house cleanup that I I think would be helpful,

for a 1 release. Like, you know, the the initial stages of this project have been

inherently

a little more messy because we are building out the foundation and moving pretty quickly with our integrations.

I would like to see that settle down more when we get to 1 0 so that we have smaller incremental releases, and we take breaking changes incredibly seriously. Vectors

reliability

and sort of least surprise

philosophy

definitely plays into, like, how we're releasing the software and making sure that we aren't releasing a minor update that actually has breaking changes in it, for example. So I would say those are the main things missing,

before we can officially call it 1 0. Outside of that, the 1 other thing that we wanna do is provide more education on some high level use cases around vector. I think right now, it's like the documentation

is is very good and that it, like, dives deep into all the different components, like source of syncs and transforms and all the options available. But I think we're lacking in

more guidance around, like, how you would deploy vector in an AWS environment or a GCP environment.

And,

that's that's definitely not needed for 1 0, but I think it is 1 of the big missing pieces that will make Vector more of a joy to use. In terms of the integrations,

what are some of the ways that people can add new capabilities to the system? Does it require being compiled into the static binary, or are there other integration points where somebody can add a plug in? And then also in terms of just use of the system, I'm curious

what options there are as far as being able

to test out a configuration

to make sure that the end to end flow is what you're expecting.

So in terms of

plug ins, basically, that's the we don't have a strong

concept of that right now. All of the transforms

that I've mentioned, sources and sinks, are all written in Rust and and kind of natively compiled into the system. That has a lot of benefits,

obviously, in terms of

performance, and and we get to make sure that everything fits in perfectly ahead of time.

But, obviously, it's it's not quite as extensible as we'd like at that point. So there there are a number of strategies that we've we've thought about for

allowing,

kind of more user provided plug ins. I know I know that's a big feature

of,

Fluentd that people get a lot of use out of. So it is something that we'd like to support, but we wanna be careful how we do it because,

you know, we don't want to give up our core strengths necessarily, which I'd say, you know, the kind of the the performance and and robustness

reliability of the system. We wanna be careful how we

expose those extension points to kinda make sure that the system as a whole maintains those properties that that we find most valuable. So Yeah. And that's to echo, Luke, on that, like, we've seen, you know, plug in plug in ecosystems are incredibly valuable, but they can be very dangerous. They can ruin a project's reputation

as it relates to reliability and performance. And we've seen that firsthand with a lot of the different interesting

Fluentd setups that we've seen with our customers. They'll use off the shelf plugins that aren't necessarily written properly or maintained

actively.

And it just implements it adds this variable to

just the discussion or running vector that makes it very hard

to ensure that it's meeting the reliability and performance standards that we want it to meet. And so I would say that if we do introduce a plug in system,

it'll

be quite a bit different than I think what people are expecting.

That's something that

we're putting a lot of thought into. And,

to go back to some of the things you said before, it's like we've had community contributions and they're very high quality, but those still go through a code review process that exposes quite a bit of quite a bit of, like, fundamental differences

and and and,

issues in the code that would have otherwise not been caught. And so it's it's an interesting kind of, like, conundrum to be in. It's it's like I on the 1 hand, we like that process because it ensures quality. On the other, it it is a blocker in in certain use cases.

Yeah. I think our our strategy there

so far has basically been to allow programmability

in limited places, for example, the Lua

transform and the the kinda upcoming JavaScript transform. There is kind of a surprising amount that you can do even when you're limited to that to that context of a of a single transformation.

We are interested

in kind of extending that to say, you know, is it would it make sense

to have a a sync or a source that you could write a lot of the logic in in

something like Lua or JavaScript or, you know, a language compiled to WebAssembly.

And then we provide

almost like a standard library of, you know, IO functions and things like that that would we would have more control over and and could do a little bit more to ensure,

like, Ben said, the the performance and reliability,

of the system. And then kind of the final thing is we we really want vector to be as as easy to contribute to as possible.

Ben mentioned some, you know, housekeeping things that we wanna wanna do before we really consider it 1 0. And I think a lot of that is around,

extracting

common patterns

for things like sources, seeing some transforms

into kind of our our internal library so that if you wanna come in and contribute

support to vector for a new downstream

database

or or metric service or something like that. We want that process to be as

simple as possible, and we want you to kind of be guided into

the right path in terms of, you know, handling your errors and doing retries by default. And and all all of that stuff, we want it to be right there and and very easy so that we can minimize. There's always gonna be a barrier if you say you have to write a pull request to to get support for something as opposed to to just writing a plugin. But we wanna minimize that as much as we possibly can.

And there are a whole bunch of other aspects of this project that we haven't touched on yet that I have

gone through in the documentation that I think is interesting. But I'm wondering if there is anything in particular that either of you would like to discuss further before we start to close out the show. In in terms of, like, the actual technical

implementation of Vector, I think 1 of the unique things that is worth mentioning is 1 of, you know, Vector's intent to be the single data collector,

across all of your different types of data. So we think that's a big gap in the industry right now is that you need different tools for metrics and logs

and exceptions

and traces. And so

we think we can really simplify that. And

that's 1 of the things that we didn't touch on

very well in this in this podcast, but right now we support logs and metrics, and we're considering

expanding support for different types of observability

data so that you can claim full ownership and control

of collection of that data and routing of it. Yeah. I mean, I could there are, you know, small little technical things within vector that I that I think are neat to talk about for a little while. But, I mean, for me, the most

interesting

part of the project is kind of viewing it through the lens of being a kind of a platform that you program,

that that it's, you know, as flexible and

programmable, I guess, as as possible, kind of in the in the vein of, you know, those Bash 1 liners that I talked about. That's something that it you you know, that can be a lot of fun, can be very productive. And the challenge of kind of lifting that thing that you do in the small on your own computer, on a on a single

server or something like that up to a distributed

context,

I find it, you know, a really interesting challenge, and there's a lot of

of fun little pieces that you get to put together as you try to try to move in that direction. Well, I'm definitely going to be keeping an eye on this project. And for anybody who wants to follow along with you or get in touch with either of you and keep track of the work that you're doing, I'll have you each add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

For me, I think

there's there's so many interesting

stream processing systems, databases,

tools, and things like that, but there hasn't

been quite as much attention paid to the glue. Like, how how do you get your data in? How do you integrate these things together? And that ends up being, like, a big

barrier for getting people to get into these tools and get a lot of value out of them. There's just there's a really high activation energy,

or it's kind of assumed that you're already bought into a given ecosystem or something like that. That I mean, that's the biggest thing for me is that it a lot of for a lot of people and a lot of companies, it takes a lot of engineering effort

to get to the point where you can do interesting things with these tools. And, like, as an extension of that, like, that doesn't go just from the collection side. It goes all the way to the analysis side

as well. And,

we think that if

if you know, our ethos of timber is to help empower users

to accomplish that and take ownership of their data and their observability strategy. And so, like, Vector is the first project,

that we're kinda launching in that initiative, but we think it goes all the way across. And so

that that

like, to echo Luke, that is the biggest thing because so many people get so get frustrated with it where they just throw their hands up and kind of, like, hand their money over to a vendor, which is which is fine in a lot of use cases, but it's not empowering. And there's, you know, there's no, like, silver bullet. Like, there's no 1 storage or 1 vendor that is going to do everything amazing. And so at the end of the day, it's like being able to take advantage of all the different technology and tools and combine them into, like, a cohesive observability

strategy in a way that is flexible and lets you evolve with the times is, like, the holy grail, and that's what we wanna enable.

And we think, you know, that process is

needs quite a bit of improvement. Well, I appreciate the both of you taking your time today to join me and discuss the work that you're doing on Vector and at timber. It's definitely a very interesting project and 1 that I hope to be able to make use of soon to,

facilitate some of my overall data collection efforts.

So we appreciate all of your time and effort on that, and I hope you enjoy the rest of your day. Thank you. Yeah. And and just to kinda add to that, if if anyone listening, like, wants to get involved, ask questions, we have,

there's a link, a community link on the repo itself. You can chat with us.

We wanna be really transparent and open, and we're always welcoming,

conversations around things we're doing. Yeah. Definitely.

Just wanna emphasize everything Ben said, and and thanks so much for having us. Yeah. Thank you.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language,

community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links