ThreatStack: Data Driven Cloud Security with Pete Cheslock and Patrick Cable

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to build your next pipeline, you'll need somewhere to deploy it, so you should check out Linode. With private networking, shared block storage, node balancers, and a 40 gigabit network, all controlled by a brand new API, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

to get a $20 credit and launch a new server in under a minute. And for complete visibility into the health of your pipeline, including deployment tracking and powerful alerting driven by machine learning, Datadog has got you covered.

With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you'll have everything you need to find and fix performance bottlenecks in no time. Go to data engineering podcast.com/datadog

today to start your free 14 day trial and get a sweet new t shirt. And go to data engineering podcast.com

to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macy, and today I'm interviewing Pete Cheslock and Pat Cable about the data infrastructure and security controls at ThreatStack.

So, Pete, could you start by introducing yourself?

Yeah. Hello. I am Pete Cheslock, and I currently run the technical operations for ThreatStack.

And, Pat, how about yourself? Yeah. I'm, Patrick Cable. I do,

a little bit of everything.

So I do security operations,

and

all the secdevopsy

things here at ThreatStack. Thought it was devops sec. Who knows? Sec. We're still figure we're still figuring that 1 out. I I think everyone's still figuring that 1 out. Valid point.

And, Pete, going back to you, how did you first get involved in the area of data management?

I guess by accident, really.

You know, the the the sheer fact of joining ThreatStack almost 4 years ago in in kind of the type of data and everything that we we store,

we keep a lot of data, we process a lot of data.

It's definitely the most,

that I've kind of dealt with, I think at this point, at least

by,

by sheer kind of event count.

But a previous company I was at, a company called Sonian,

we were an email archiving company. And,

the

if you think like email archiving, the number of kind of events,

like emails would be a lot less than kind of a security event.

But

the amount of data emails take up, if you think like PDF attachments and other things, were petabytes. So we were we were storing

a significant more amount of data then, just not as many kind of individual items as we do today. Different different types of scale scale, I guess.

And, Pat, how about you? How did you get involved in the area of data management?

Strictly

along the ride here at ThreatStack. Right?

So,

1 of the things that I care about is security time series data. Right? So,

seeing when stuff happens on servers, and that actually kind of dovetails nicely into to what ThreatSack does. Right? So we capture

system calls,

and other data about your your EC 2 and,

actually just not just EC 2, but any infrastructure,

you know, running our agent.

So,

you know, I'm kinda along for the ride when it comes to data management,

because I have to make sure that we're securing all of it properly.

And so as you mentioned, ThreatStack is a

platform and service for being able to capture

event data within

an infrastructure in your network environment for being able to analyze that for intrusion detection.

And I'm wondering if you can provide some compare and contrast between the

system and offerings that ThreatStack

provides

versus

what the existing landscape was at the time when it was first started in terms of either open source or hosted

options

for doing the same kind of analysis of determining what threats might be present within an infrastructure or network environment? Yeah. Yeah. Yeah. So 1 of the big things that I think differentiates us is a lot of the existing solutions would focus on logs. Right? You know, somebody decided to

log out some information

and, you know,

there are existing solutions that will collect and aggregate all that log data. And all that log data is is is really good stuff. Right? It's it's really important you you wanna be capturing that log data. But the thing that always kinda

fell short for me for log data was, you

know, what about stuff that I'm not logging, but I wanna know what's happening on the system. What is

somebody logging into the server and actually running and executing. Right?

So the thing that I found very exciting about ThreatStack

was that we could do something

like have an agent collect all the syscalls, which is, you know, for folks that might not know what syscalls are. It's the base

kernel actions for

doing anything in Linux. Right? So,

you

know, let's say that I run an application,

I run a Ruby script. Right? I'm gonna be calling exec to Ruby,

which is an action that I might care about. I'm gonna be running open on the actual Ruby script itself and so on and so forth. So

our advantage, I think, is that we collect all of that system call data and all parameters that go with it, and then can start constructing a a timeline in most of what people are doing on these servers.

Yeah. The from the open source component, so when I started here in 2014,

and I was part of the kind of a prelaunch team, so really very early in the threat stack,

kind of timeline.

The closest competitor

was kind of open source concepts around,

audit d. So in the very beginning, the ThreatStack

product was kind of an audit d replacement and cloud hosted. So we would,

have our own agent, which basically worked like audit d, but, much better.

And we would collect, like Pat said, all of these system calls and ingest them into our platform,

doing kind of analysis

and,

determining what's going on. Trying to answer that question of, like, who did what when.

Of course, as we've grown over the last 4 years, we've added in a lot more features, which

kind of kind of blurs the line of, like, what other alternatives are out there because we can do things like vulnerability assessment.

We can integrate with,

Amazon, specifically around CloudTrail. We can suck in those events and and and make sense of them. And, you know, doing other things like threat intelligence,

detection,

and, and and really trying to

we use the word the word inform a lot. You know, trying to, like, inform our customers into anomalies in their environment. Yeah. Yeah. So the 1 that I think of

is, you know, you have a web server running and,

maybe it hasn't been patched in a while,

and all of a sudden you see

your web server process fork bash.

Like, that's probably bad.

Yeah. I'd say that's not good. Maybe it makes a connection to, like, a IP in China too. Like, that would be pretty bad. Take your

pick. So, being able to kind of alert on those too. Right? So there's, like, events, and we capture a lot of events,

And then customers will write rules to bring those events to,

step 1, step 2, step 3.

And

the focus seems to be largely around

the hosts and the processes that are executing on those hosts. And I'm wondering how I'm wondering whether ThreatStack has any overlap

with tools such as Snort or Suricata, which are more for detecting inbound network connections

really even make much sense in more modern cloud environments, or if those are really even make much sense in more modern cloud environments or if those are more of an artifact of the data center oriented

infrastructure.

What's pretty funny is that the founders of ThreatStack actually created an open source project called Snorby,

which was snort,

but, like, on Ruby on Rails. So it was like a Ruby on Rails tool for managing and snort kind of data. And even if you go back to the very early

kind of before I started, ThreatStack ideas that the founders had, the initial idea was kind of a a place that,

people could, pull in Suricata and and Snort and other kind of intelligence data, I guess, in order to understand, you know, what was going on in their environment.

Only later when they realized that the value really was in,

being able to kind of have an agent that can capture this data off the host directly and try to tie that in. That's when they really started going down that path. But, you know, to be honest, I can't I can't really speak for, you know, some of those tools. I haven't used them in years at this point, and how they've kind of aged

for the more I don't know. We call it cloud native world, right, where where hosts are very ephemeral. They come and they go.

You don't have the the long lived instances of of your for the most part. You know, 1 thing that strikes me

is

when you look at some of these network based tools,

it's it's interesting to see, you know, host has connect to thing,

but without the surrounding context of what's running on the machine, you know, the next place an analyst is gonna go is, okay, so let me figure out what on the machine made that connection. Yeah. What was the process that was on the machine? Who who ran that pro what user ran that process?

Yeah. Yeah. So it it's

it's interesting to me

that

these tools certainly serve a purpose, and I think they serve a

a a bigger purpose when you have longer lived instances.

And now when we're talking about cloud,

instances may come and go, you might not necessarily know what was run on them, you know,

it's a little bit of a harder sell, I think, for those other tools.

And in terms of the

types and structure of the data that you're collecting with your agents,

I'm wondering if you can describe

what the typical

events would look like

and what the flow is for when the event is captured on the host by the agent,

all the way through to,

you storing and analyzing and potentially alerting on it.

Oh, so many pipelines. Yeah. There's a lot a lot of stuff to that 1. That's all yeah. I'll give it a shot here.

So the type of data, like Pat was saying, are essentially Linux system calls, network connections, process connections. There's a few more things. We do something called file integrity monitoring, which is where we can put watches on kind of key files in your infrastructure.

So that, maybe like an SSL key or a configuration file that has, like, a database password in it. We can put a watch on those files. And then when a process accesses

them, we will,

we'll be able to see which process accessed which file, like, if if if you had a watch on it.

So,

you know, agent events, that's kind of 1 type of event we can capture, and then CloudTrail events would be other ones.

The nice thing about CloudTrail events is that they're they're pretty standard. You know, people turn on CloudTrail for AWS, which is like a API auditing, for lack of a better term, of, like, your your API usage on on AWS.

But on the agent side at least,

you know, the the data, that's what we consider to be much more real time.

CloudTrail events that we consume, there's a delay that it takes for Amazon even to put the CloudTrail events in your in your s 3 bucket.

Once they're in the bucket, then they're in our infrastructure very quickly. But it's actually the agent events which poses

the, the largest kind of technological

challenge for us. Because if if you have for every process, you could have

100 or 1, 000 or tens of 1, 000 of of Sys calls associated with it, or every time maybe your process forks or

opens a connection, you know, there's multiple system calls associated.

And so the ultimate goal in the very early days and continues on to this day of ThreatStack is to

always be consuming this data because they are they're all potentially security events

and to, to process it, you know, as fast as possible.

You know, at a high level, you know, the agent will send this data to into ThreatStack, and we run on Amazon Web Services. We're very big fans of

of using,

AWS.

And,

you know, the data comes into our environment. And, of course, the evolution of our of our platform has changed over time. You know, at a at a real high level in the past, we we used to use

a lot more of RabbitMQ.

So if you kind of rewind back to 2014 when I started,

you know, we're we were bringing in,

RabbitMQ

essentially as a way to replace,

0MQ, which was the original, like, messaging layer of kind of threat set platform. We ran into a lot of issues with 0MQ at the scale, even the scale we were at in the very early days. And so RabbitMQ at the time, myself and, our CTO at the time, you know, we we had a lot of historical experience with RabbitMQ. And scaling RabbitMQ was something I wasn't really too worried about.

So we decided to bring that in. A lot of times people are always saying, oh, why not Kafka? Well, 2014 was a different time. I don't even know what version Kafka was. Was. But to be honest, I wasn't willing to bet a company on something that

that new.

But again, fast forward to 2018

and now we have multiple Kafka clusters that we're actually bringing in place

that are replacing a lot of the pieces of kind of what we would call our old RabbitMQ

Kafka topics, and then we can have lots of different processes or, you know, microservices,

I think is the the new fancy term. That is the new fancy term. Yeah. So you have these micro services

that are, you know, accessing all this data in Kafka, and then, analyzing it, and doing different roll ups and things like that using,

some tools like Spark. I think more recently, we've been using tools like Flink,

for,

kind of rolling up data or

doing I don't know if you'd call it, like, a batch processing.

I mean, the roll ups. And then

under the assumption, let's say 1 of these security events,

triggers an alert, you know, then we can, like, tie into 3rd party services and, like, Slack, for example, or HipChat,

or even like PagerDuty. We could send a message to PagerDuty and and alert you if, let's say, someone installs like a kernel module in your environment, which that's not good. Usually not a good thing.

Unless you're doing embedded development, I guess, but

And there are a lot of different topics out of that that we can go into. And to start with, I'm wondering

how you ensure that the event data that you're being delivered has a consistent format so that it's easy to,

manage that as it traverses your various pipelines and so that you can ensure that you have the necessary information from source to destination?

Yeah. Yep. You know, our first step when we take this data in is to basically make sure that it matches our expectation for what this data is. Right? So we have a service that its job is to figure out what is

the you know, to make sure that the data is in the right format.

Yeah. The I guess the the best part about having our own agent

versus, like, audit d. A lot of times people said, oh, well, you're it sounds like a lot like audit d. And at a really high level, many years ago, our agent looked similar to audit d, except it was built to be much more performant.

The biggest difference though, especially in those early days, it's much different now because we're doing a lot more. But

the if you've ever looked at, like, audit d

log formats,

it's like multiline.

You it comes in out of order.

It's it's just very difficult to parse in many ways. You have to correlate multiple log lines in order to get the full concept of a of an event. Right?

With our agent, we actually take this data and

store it as a JSON object

and

store things like the process name, the port number, if it was a connection. Right? Or like an IP address.

And like Pat was saying, we'll have these services that can not only ensure that, like, the data is in the format we expect, but, like, also do kind of, like, data sanitization.

We ran into weird issues where,

the Linux kernel like, we just take what's coming off the Linux kernel. It's actually the audit APIs we interface with.

But the Linux kernel is,

I don't know, not great. There's well, there's a lot of cooks in that kitchen. There's a lot of cooks in that kitchen. That's a good way to put it. So we've seen scenarios where

you might get a process ID that's completely out of range,

or a negative number,

from the Linux kernel. And so, you know, how do you how do you deal with that? Right? So we have to do some, like, data sanitization to to make sure that, like, further on down, you know, down other data pipelines, like, we don't run into issues with that. Like, expecting,

a process ID within some some, I guess, normal range and have it come back to us as, like, for Unicode or something like that. Yeah. Or,

you know, the

source port, you know, for example, making sure that it is actually a valid port from 0 to 65535

versus, like, negative 9. Well, actually and that leads me into a hilarious story of, when we first launched our we called it our limited availability. It was, like, the month before public availability launch.

I was actually not there for that launch. My son was being born at the time. So I basically, at, like, 4 AM, went to the hospital with my wife and said,

good luck. I'll see you in a couple weeks.

And, when they launched the limited availability,

some data getting sent from an agent was completely malformed, and they had to track it down to find out what agent it was. Turned out actually to be my test agent that was running at the time on an IRC bouncer,

and it was doing exactly like that, like what Pat said, which was it was sending,

like, a port number

of, you know,

like, I don't know, 30, 000, 000 or something. Like, some insane number that, like, in in no way is is real. It's like 1 of those IPV 8 ports. IPV 8.

Yeah. I I think the we're in a very unique position to like Pete said earlier, you know, we do have our own agent. So it's not that we wouldn't

do sanitization because that scares me as a security person.

It's just that we get to be more strict about what we accept in the first place, which is a nice place to be in. And

playing again off of your story of having to track down where the particular data was coming from,

what mechanisms do you have in place to ensure the provenance of the data that you are collecting

and ensure that the

chain along which it's being delivered

is verifiable so that you don't have anybody

maliciously injecting false data into your system?

Yeah. So it's there's there's a lot there.

A lot of the stuff around agent auth is could be a whole another podcast.

But we do have,

you know, a authentication

structure within our agent that

makes it so that both sides have to compute something to to figure out what is, you know,

the data is associated with that agent ID, and that's a that's a big part.

But also, you know, as we go down and and we build services out, you know, security has been something that we've been

part of our organization since the start.

And we've only made more of a commitment to that, you know, as we've built out our security team. So,

you know, we have

a lot of work that goes into,

you know, how do we how do all these microservices talk to each other and ensure that,

you know, the requesting organization

and the requesting user makes sense for the data that come that's coming back.

And so, you know, we do a lot of work in

making that,

making that story in in that,

performance. You know?

It's

it's an important thing because, you know, obviously, it would be a very bad thing if

1, you know, user's data went to another user. Right? That's that's that is a pretty big failure.

And so, yeah, we've we've put a lot of stuff in code to basically make sure that, you know, the requesting organization, the requesting user makes sense for the data that comes back, and we check it on the way back out too, which is kinda neat. Yeah. I guess ensuring that,

the data coming from a series of agents, so if you think of it like associating, you have a customer,

they'll have an organization that's kind of a top level idea of ThreatSack where you'll have an organization. What's actually very cool about how we've designed the product is that you might have a

maybe a large company with a lot of different organizations with different pieces of infrastructure. Let's just think at a very basic level, you might have a dev environment and a prod environment.

The alerts and the rules that you define within the prod environment might actually be different than dev, but maybe you run ThreatStack across both, which a lot of our customers do. And so you could kind of have 1 customer with multiple organizations.

But, we do a lot of interesting things on the encryption side as well to ensure that WES data is landing in our environment, that it gets encrypted with kind of this like organizational

key.

Not even like an account,

like a customer specific 1. It's really down to the organizational level.

What's really interesting about that as well is that that we can ensure that, like, once a customer's long no longer with us, we can just shred keys as a way of removing any sort of access to that data. Yep. And

going deeper on the idea of having your agents deployed across multiple environments within a given organization,

what types of

contextual data or additional formatting

or even rate limiting of the data do you provide to the end user for being able to manipulate

how the information is generated and what additional context is contained within the events that get delivered?

Yeah. I mean, so we

don't really focus on, you know, we wanna be able to capture all of those events. Right? Because if there's too much control on the user side, it would be trivial for an adversary to go in and say, like, great. Don't send any events.

Like, Yeah. We really try to

we treat the kernel as the source of truth.

So if it comes off the kernel, then the goal is to essentially get it to our infrastructure

without modification.

Yeah. Now that said, you know, there are ways in which, you know, we can you know, if we know and a customer reaches out to us and says like, hey, you know, this

particular command is is kind of sensitive or, like, the inputs to this command will be sensitive, then that's something we work on on a case by case basis. But it's That's actually a really good point, which is, you know, you might have a customer who says,

I have some processes and,

you know,

PCI related or something. And, you know, I don't know. Like, you you Well, HIPAA is the 1 that comes to mind. Right? Like, you're you're processing data in some sort of, you know Or like an argument.

Like An argument is the patient name. Right? Yeah. So, like, let's say you have a command that runs. We capture not only the command, but all the arguments associated with it. You'd wanna know, like, what flags maybe were passed to a command or if bin bash was executed with some other service. Right? That would kinda be an argument.

And so for some customers, they say, oh, well, the way that this application works is we pass essentially PII as the arguments.

They do us like, well, how do how do we ensure this data actually never leaves our environment? And so we actually have some some kind of filtering technology on the agent side where customers could kind of define

what is PII for their environment to kind of re remove the likelihood of leaking, kind of sensitive data. Yeah. But that said, you know, that's a very small amount of our customers. Most of our customers want that

that full visibility and just want everything coming off the kernel, to show up in their interface. Yeah. We often refer to it as a fire hose. It truly is,

a fire hose of data in many ways.

And

with how much volume of data you're working with and the various different pieces of how that data is flowing through your system, how do you ensure

an appropriate level of consistency

and scalability in the actual underlying infrastructure that you're deploying and ensuring that

the systems are operating as intended?

Yeah. That's a great question.

The there's an animated GIF that I use often when I talk about how to scale on the cloud. And it's a picture of,

I think it's from I don't even know what cartoon it's from, but it's like the old lady from, like, Tweety Bird, I think, throwing money into a fire, just like money. Oh, the fire. And I use it often. So I guess 1 benefit of using the cloud is getting more cloud is is usually not too hard. Of course, there's instance limits you have to watch out for. But, you know, if you have money, Amazon will happily take your money for more cloud.

So that's obviously 1 big place. We we scale up 100 100 and upon 100 of nodes maybe to handle scale events, scale back down when we're done processing.

You know? But for the most part,

the

the load as as data is coming into our environment, you know, we use

probably our largest databases that we use are, you know, open source databases like Cassandra or Elasticsearch.

Cassandra is great because it allows us to do quorum rights, which from a data durability standpoint, we can run

across multiple

availability zones or multiple regions

to ensure,

kind of the durability of the data. So we won't acknowledge a right of data

unless it has written to multiple,

Cassandra

racks is the term. Right? And so a rack for us might be, like,

another AZ or another region.

Elasticsearch has something similar where you can do fun things around kind of primary and replica shards where you can say, like, a replica shard can't live in the same rack as a primary. So we essentially do the same thing.

And that's been a big help. Kafka too as well. Actually, 1 of the biggest benefits of Kafka is that ability to,

do quorum rights where you can not acknowledge a right into a Kafka cluster

unless it has,

written to multiple,

again, same idea, racks,

availability zones, regions, things like that. And on the subject

of your migration

from RabbitMQ

to Kafka,

I'm wondering what you have found to be the benefits,

whether it's in terms of cost savings, either time or monetary

or simplification

of the overall deployment or management of the actual infrastructure

or any other tertiary benefits?

Sweet, sweet durability. Yeah. RabbitMQ

and I see this as someone I've been using RabbitMQ for,

I don't know, almost a decade for a very long time. So RabbitMQ is something I have a lot of comfort in running.

If you actually rewind back to, when I was at a company called Sonya and RabbitMQ was 1 of the core pieces of infrastructure

to the point that when we

built out an open source SENSU, the monitoring platform,

we used RabbitMQ for transport of essentially monitoring events.

And if you are a SENSU user, you know, RabbitMQ

is still an option, until the new version of of SENSU comes out. And so when we brought on RabbitMQ

initially, you know, some of the the challenges of RabbitMQ

scale and durability comes to just

how to, do replicated cues.

And they've they've done a lot of improvement around there, but it still kinda suffers from from

maybe the same idea as, like, Elasticsearch, where you don't have an idea of, like, quorum rights, where you can't

ensure rights across

multiple places,

you

know, in in in that, in the same way that that services like Kafka and Cassandra do, do. And and maybe that's the case. It's maybe it's better now. Honestly, I haven't looked too recently at at the more more bleeding edge versions of RabbitMQ.

For us, the main reason of moving it was

essentially data durability.

We wanted to have a higher degree of

of understanding that the rights into Kafka were secure. But 1 of the actually interesting parts of around the original

RabbitMQ

architecture was,

the way in which we

ensured

that,

you know, we could scale it as we basically created this concept of pods.

And this is before Kubernetes was around. So we're like the hipsters, I guess, of the the term pod. But we have an internal term of a pod, which is a a RabbitMQ with a series of services kind of wrapped around it. And so those turn into our individual scale points,

where if we need to run

more scale through a single RabbitMQ,

because there are gonna be upper bounds, just based on the the the size of the instance, we can

basically duplicate these pods, and we could run, you know, 5 pods or 10 pods. And we can span these pods across

availability zones as a way of of cop you know, essentially

writing bits across multiple places.

But the the ultimate issues why we decided to move off of it was we were building out, new infrastructure a couple years ago,

and it was essentially how we,

manage alerting our customers

when a security event occurs. And you can imagine an alert

sending an alert that a security event occurred or that a specific system call was a anomaly, you know, sending that alert is is of high priority.

So and and and the durability needs of that were,

we need to be much greater than, honestly, we we believe that we could have gotten out of RabbitMQ at the time.

So that was a big reason to move there. Now the benefits have been it is

significantly more performant than RabbitMQ

for, again, our load type.

We actually ran into a lot of issues, which,

again, mostly just how we're using RabbitMQ. But under scenario this kind of scenarios we ran into that we felt a lot of pain. And talking with other users of RabbitMQ at certain levels of scale, they kind of have shared similar thoughts.

But when you are, you have a very loud producer, so you have concept of a producer or something that puts something on RabbitMQ,

a queue or an exchange that writes to a queue, and then a consumer.

Under scenarios where you have a lot of data, the consumer could actually get blocked on the producer's rights.

And so how this kind of manifests is you're writing in a lot of data onto an exchange.

And the queue the the consumer on the queue that might be list that might be part of that exchange,

is actually unable to keep up,

with processing data. And it's,

the only way to kind of solve this is, like, we we do some kind of,

magic with our service discovery where we we essentially, like, move a an entire pod out of commission so that it can essentially catch up as and we might bring in, kind of net new pods into our infrastructure. So

Kafka has yet to to run to that same issue because of of its right patterns and read patterns. Obviously, there's totally different load patterns and,

you know, kind of scale,

intricacies of Kafka that I'm sure a lot of people talk about more recently. But,

so far as and I'm a I'm a technology hater.

I I feel like there's, there's a lot of stuff out there. And, you know, on a long enough timeline, all software is terrible.

But, I'm actually really pleasantly surprised by Kafka in in its improvement for kind of our use case at least of of, you know, kind of having a a a stream of time series data essentially that we're processing.

And in terms of the analysis

and determining

when to alert your users, I'm wondering how you determine

when something is a potential threat or whether you have any sources of,

data that you use to

create

these signatures

of what an attack might look like, particularly given the continually evolving landscape of security threats and new types of attacks?

We don't so we don't use signatures, and that's always been something that we talk about a lot is that we're essentially kind of behavior based. Right? Like identifying

maybe risky behaviors that could lead

to a breach.

The goal being is you obviously can identify it before it happens. But,

but, yeah, you're right. There's definitely an evolving landscape of of of kind of threats out there, advanced persistent threats or or There's some mediocre persistent threats too.

Yeah. So

it's really based on,

so we have a set of rules that

based on some of the data that we'll get out of auditing. Right? So,

you could write a rule that, you know, let's say that you set up a user that you,

let's say that you set up a a honeypot. Right? So there's a machine

that has a user with a known weak password.

You could, you know, set up an alert in ThreatStack that says, you know, any anything that this user does, just alert me on it. Right?

So, you know, we do we ship a bunch of alerts out of the box,

and then we work with you to kinda customize, you know, what makes sense for your environment and, you know, what

you how does how does your workload run? That's actually a really good point, which is for some companies,

running

NGINX, for example, that's normal. Right. They have that's that's their maybe web server or reverse proxy they're using.

But for other companies, if they saw NGINX running, that could actually indicate a breach or threat. Right? For some companies running Netcat

is not great.

But Netcat is a really powerful tool for debugging,

kind of network connections. So you were 1 of the 1 of the my favorite rules,

that I'm I'm kinda writing right now is,

like,

certain classes of servers should not be running GCC, like ever. Right? Yeah. That would be an indicator of compromise to me. Right? So,

you know, you for any of the compiler tools, like, on certain servers, like, that that just should be a sub 1 because that means somebody is,

trying to maybe do, like, a proof of concept exploit or something like that. So I I kind of focus more on

the the rule sets as, like, these are things that I don't want to see happen in my environment,

or I know that they happen and I want to suppress certain types of them,

and

everything else I'm concerned about. So it's it's not signature based, which is nice.

It's just based on, you know, we have some basic rules,

for tools that are,

typically associated with exploits and stuff like that. But also a really basic rule if you're trying to identify for data exfiltration,

You know, for all for, again, Linux is what we're talking about. There are usually only so many ways to, like, get data off of a Linux server. And so if we basically look for those processes and we start seeing

usage of of various tools that could exfiltrate data, you know, we can Or even just connections outside

my, like, IP range. Exactly. LAN. Right? Like, that's probably bad too.

So there's there's a whole way that you can start to customize and think about how you, you know, these things. 1 of my favorite ones

is and the and the 1 that I use as my biggest input to the security program

is

sudo.

I wanna see when people are doing privilege escalation. I wanna see what they're doing, when they're doing it, why they're doing it. And I I alert on those

because I want to find ways to reduce that. Right? So it turns into a good way of identifying

potential manual events happening in your environment. You know, we we did a a webinar with with a customer quite a while ago, and and the way they actually use ThreatStack was interesting in that. It was a company that had been around for a very long time. And so just like most companies, you have turnover. The people who maybe created the thing are no longer there.

And it allowed them to kind of bubble up this visibility

around

operational

manual tasks that were going on in their environment, like user edits a file, restarts a service. You know, they don't tell anyone they did it because it's something they've been doing forever.

But it's it's happening. It's not visible. No 1 knows that that this is actually something it's the term, I think, that was

popularized in the Google SRE book is toil. Right? It's a it's a thing you spend time on that is not moving the business forward. Right? So it's kind of identifying those scenarios as well is pretty helpful. Well and and you think about, you know, well, does is ThreatStack finding, you know, all of these threats across all these infrastruct like all these customers. And you

yes.

But I think the more interesting output of threat stack is what

are people doing with infrastructure

that maybe we need to,

optimize for. Right? Like, you know, companies have said, oh, we we have a CI system. We know that our code is being deployed in this particular way. And then they install ThreatStack, and it's like, well, actually,

Sam logs in and SCP's a file over to deploy that app because the CI system's been broken for, you know, x amount of, weeks. Right?

So it it's that visibility, and it's not necessarily,

you know, vulnerabilities.

It's

just breakdowns and process. And and I have this whole thing that I've I've thought about for a while that's like good operations tends to lead to good security. Right? Like, if you're doing all of the good operation stuff,

you have made it sufficiently more difficult for an adversary to

do something bad. Right? If you're doing regular patching, if you're doing,

you know, limiting pseudo access,

you're probably in a better spot than a large percentage of of other people,

and

adversaries are gonna go for an easy target.

Yeah. It's kind of a old world of

this is very, like, nineties or early aughts, right, of the you get root. Right? It was all about getting root. That was the old school the old school hacker days. Because if you got root on a host, you probably got mail server,

file server, I mean, DNS server. You if you got root on 1 server, you probably had access to a lot of stuff.

In in the current model and and even in a lot of ways, the future model, this this gets more challenging

on both sides. You have tools like Kubernetes where

you're scheduling processes to run across a series of hosts.

And so

what host what container is doing which which act you know, file system activities or or network connections and things like that. And it ran on host a,

you know, last second, but then it got scheduled again and it ran on host b.

And so, you know, how do you how do you get visibility kind of into that world? And, you know, it's it's something that, it's it's always fun at ThreatStack because we're, you know, trying to see how people

kind of do their operations in order to,

you know, help make it easier to get that visibility. Like, what is actually going on? Who is doing these things, and and how can we, you know, maybe not do that anymore? That's my job.

And what are some of the most common types of vulnerabilities

that you've noticed in customer environments?

I I kinda go back to

the what we just said, you know, it's it's

the, you know, it's less vulnerabilities,

though, you know, I I have seen the,

you know, the example that I gave at the beginning where it was kinda like, yeah, you know, your web server is running bash all of a sudden and running p s a u x. Right? Like, that seems weird.

So, you know, that kind of just kinda comes to my mind naturally as I you know, it's something that I've seen before.

But also,

you know, it's it's mostly the breakdown of of people and process.

It's it's not necessarily, like, 1

single vulnerability, or I I know I I look at all the data and and say and you say something like, oh, a bunch of people are running this version of Linux. That's

that's not, you know, quite what what I focus on. I'm I'm more interested in

the behaviors

of of users and and how they're using infrastructure.

It's interesting too because the most likely place you're gonna get hacked is via probably email.

And the the outcome of that is probably someone paying an invoice to someone they shouldn't pay an invoice to. Actually, a friend of mine just told me the other day about how they got

basically a breach. I guess you'd call it a breach where they someone paid an invoice that was from a scammer.

You know, so those, I think, are are pretty often

the case. What I think is interesting, especially in this kind of cloud world, is you hear a lot of people talk about, you know, kind of immutable infrastructure.

You know, the great and glorious world where, you know, you you prebuild your images in advance, and you deploy them, and they they never change in production.

It's that never change in production part that I always find to be kind of funny because

what and I actually had a conversation,

with some operations folks in that 1. They I was spending some time doing some testing around kind of in place kernel updates.

There was, as many people know, vulnerabilities in the Linux kernel. We want to patch those vulnerabilities by updating the kernel.

And, I was spending a good amount of time testing this because I we have a lot of databases, and these databases have a lot of data. There's a cost of everything in Amazon. So it is not financially viable to say, I'm gonna roll up an entirely new Cassandra cluster and copy over a few 100 terabytes of data, because there's a there's a cost to running 2 clusters. There's a cost of copying that data.

It all it all costs money

and time as well.

But,

you know, the response from this engineer was like, oh, well, like, at previous company, we just used to build AMIs,

with the new version of everything. And I was like, yeah. We do that here. And then for systems that are easy to roll, like stateless systems, like, roll a new host, roll out the old host, call it a day.

But the thing that I worry about most often is,

you know, people build those AMIs with the latest, especially cloud native type companies, with the latest bits. They deploy them out, and they run for a period of time. And maybe they're not doing something just as simple as enabling nightly security updates,

which then take the latest update.

And there's 2 schools of thought on something as simple as kind of nightly security updates. Maybe on 1 hand, you

don't like, you wanna deploy

updated versions of of your

OS software at the same time as you do a software deploy, which is that's great. You know, test it in dev,

push it all in 1 bash. I think that's a great idea.

But the other way is that, you know, I would rather take the latest version of whatever is coming out in, you know, specifically, let's say, like, Ubuntu,

you know, security updates or, you know,

CentOS security updates. I'd rather take the most recent version of all of those,

whenever they're available just to try to stay kind of 1 step ahead and not,

and really just not have to think about, like, what is my patch management strategy?

I can let kind of the OS take care of it.

And in terms of

the data infrastructure

or the anomaly detection mechanisms or even just the business aspects of growing and managing the business? What have you found to be the most challenging aspects, whether in the past, current, or that you're facing going forward?

Yeah. So I'll give my most challenge. I think the most challenging, honestly, is, is really around the people side. It's,

you know, for

any company growing and scaling, the hardest part is always finding people. So,

and and, you know, kinda finding the right people at the right time with the right experience.

You know, the

challenge that we're dealing with is, is different in a lot of ways. And so we we often tell people, you know, when they come on board ThreatStack, it's like, oh, you know, you know, it takes good 3 to 6 months to really kind of fully ramp up and get up to speed. And and so often people are like, oh, like, yeah, I'll be up to speed in no time. And then when they start seeing, like, the scope of data and the amount of data and the fact that the data grows just exponentially every every week. Every week, there's more customers and there's more agents and there's more data. So you're kinda battling both, you know, change inherent change in scale as well as just, well, there's some new features we wanna push, things like that. Although, Pat, I think has the probably the best example, which is the, you know, have you ever recreated your your software before? Right? Oh, yeah.

Yeah. I mean, you know, as we've grown, 1 of the things that we had to do is is figure out how do we monitor ourselves with our own tool. And we were doing this for a while in in in a way that I would describe as,

not optimal.

So, you know, 1 of the things that we did was, you know, we rebuilt ThreatStack

in a new AWS account so that we could use it for monitoring, for testing, for,

you know, kind of a QA environment, but also, you know, with my security head on, so that I have

a environment that's stable, that I can understand what's going on inside the company. Right?

So rebuilding ThreatStack from scratch,

which is it's it's funny too, because, you know, for kind of example, our our our development environment is 100 of servers.

And so when you when you think of yourself like you're a large

SaaS platform, you're a hosted platform for customers,

you know, could you essentially from scratch rebuild that? Because they oftentimes, you build the first thing and they evolve over time.

Many over many years maybe. And so, like, could you go and start from scratch and rebuild it all again? Right?

Which it was a very interesting exercise.

It sure was.

And a lot of fun as well. I mean, it's it's always fun to go through those, go through those processes.

And what are some of the projects

that you have planned

going forward to improve your overall capacity

or the feature set or capabilities in your infrastructure and in your product?

Yeah. No. It's a good question. So,

you know, we're we're constantly looking at how we kinda store and manage the the wealth of data we have,

you know, for

different databases. We use a lot of different databases. And it's not a bad thing because,

you know, Postgres might be very good at handling some types of data, but Cassandra might be very good at handling other types. We've been playing around with databases like Druid, which we've heard a lot of good things about,

for, you know, kind of handling and processing data.

Flink was the kind of the newest way in which we've been kind of toying around with how do we how do we process data,

and and manage it as well.

But then, you know, just

essentially,

we it's it's very fun because we get to try things out very safely.

We can, you know, kind of we do a concept called dark shipping, where we might dark ship a new database or a new feature,

where we can actually put it under

production load for, you know, for those terms and and see how it reacts and run queries against it in ways that are not kind of impactful to our customers.

And we can see, like,

well, you know, this database is you know, the queries in this database is, you know, a few 100 milliseconds. That's perfect. But, in this other database, it's, you know, thousands of milliseconds.

You know, maybe that's not ideal for us. So,

so there's always looking at different ways to store data and to to also be able to do it cost effectively. Whether that means, you know, is s 3 a great place to store data that maybe is infrequently accessed?

Or for things that are very infrequently accessed, things like Glacier,

you know, make it cost effective to store a lot of data.

But, yeah, for the most part, you know, we're the nice thing is we're at a size where I usually like to use the term is we're just moving the ball down the field a little bit, you know, and these slow kind of continual improvements in how we process data and what the infrastructure looks like, where,

you know, Kafka has been a great success over the last year. You we're gonna be bringing in much more of it, and we want to, you know, we like the durability and the failure patterns of it and and kind of investing more time there.

And then even on the

kind of operational side,

improving some things like service discovery.

We use Chef for configuration management. And as many people,

in the Chef community, you end up using, like, Chef search, using the Chef server as a source of truth.

We've been toying around with tools like console in order to store

services

and and discover those.

And even tools like Terraform, you know, Terraform has a good community around it. We've decided to, you know, spend some time migrating from CloudFormation,

which is great because it can help you build stuff on Amazon. But Terraform can actually help you build things on more than just Amazon, which is, which is nice to kind of get that stuff into code somewhere. So

it's it's really kind of just, you know, this this interesting, slow, fun,

you know, iterative improvement on just how we get work done, for lack of a better word.

And are there any other topics that we should discuss before we start to close out the show? That's a good question. I don't know. We went through a lot. Yeah.

I can't think of anything. So for anybody who wants to follow the work that you're both up to and get in touch or, keep up to date with ThreatStack, I'll have you add your preferred contact information to the show notes. And as a final question, I'm wondering if you can share your perspective on what you view as the biggest gap in the current tooling or technology for data management.

Yeah. I got 1 for this 1.

So I I really like Kafka. I really like how reliable it is. I like,

everything about it except managing it.

So when I think of, like, the command line tooling for, you know, reassigning stuff,

It's a lot of,

you know, pipe out this JSON and then edit this JSON.

And That's actually a really good point, which is the biggest gap is,

many

databases

are,

I guess, they are designed operationally poorly.

Good examples, as I can explain, like Spark,

involves, like, shuffling bits around and having, like, host files set up. And,

operationally, it's it's not fun to manage.

Cassandra, equally as well, ways in which you have to define the ring via it's defined via, like, IP address, which if you control your IP addresses in your own data center, then great. But if you're on the cloud and, you know, this is like pre elastic network interface, you may not control your IP addresses. And so how do you manage that? But, yeah, PAC groupings have a great point, which is there are databases that I would say are very operationally friendly because maybe they have fantastic APIs.

Like Elasticsearch has great APIs.

And it makes it very,

at least, again, from my perspective as a long time user of Elasticsearch, trivial to manage. But, as a new person into,

Cassandra, when I started ThreatStack in 2014 and now Kafka,

these are these are

interesting in that they really lack good tools around them, at least at this point. So, you know, if if anyone any 1 of your listeners out there is building some new software

for, managing data,

either build really good APIs so other people can create tools or, you know, basically build good tools on top of it for, like, people to actually manage. Yeah.

Alright. Well, thank the both of you for taking the time out of your day to join me and talk about the work you're doing at ThreatStack. It's definitely an interesting problem space and an interesting solution. So I appreciate your time, and I hope you enjoy the rest of your day. Yeah. Thank you. You too, man. Yeah. Thanks for having us. This is awesome.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links