Making Data Collection In Your Code Easy With Rookout

Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need some more to deploy it. So check out our friends over at Linode.

With 200 gigabit private networking, scalable shared block storage, a 40 gigabit public network, fast object storage, and a brand new managed Kubernetes platform, you've got everything you need to run a fast, reliable, and bulletproof data platform. And for your machine learning workloads, they've got dedicated CPU and GPU instances. Go

Go to data engineering podcast.com/lunode

today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Liran Hemovitch, CTO of Rookout, about the business value of operations metrics and other dark data in your organization. So, Leron, can you start by introducing yourself? Hey, Tobias. It's great to be here. My name is Leon Khamovich. I'm Rukav's cofounder and CTO.

I've spent most of my career doing cybersecurity.

And for the last 3 and a half years, I'm the cofounder and CTO of Rocout, where we deal with data extraction, data fetching, and, in general, how can we easily and agilely get the data we need on the fly. And do you remember how you first got involved in the area of data management?

So about 3 and a half years ago, my cofounder and I wanted to found a startup, something in the area of DevOps and DevTooling.

And as we dug deeper into it, we were fascinated

by how hard it was to collect data from our own software. I mean, it's our code, it's running on our servers, and yet it could be ridiculously

hard to get a piece of data from it. Stack trace, variable value. And as we dug deeper into it with our cybersecurity

experience,

we came up with a novel new way of how can you extract the data on the fly without going through the traditional process of embedding

the data extraction code into our application code. And so for the types of data that we're talking about,

obviously, in your application, some of the types of things you're looking for are log data or system metrics. But what are some of the other ways that that data manifests and the types of information that you might be looking to extract? So that data manifests

in a very, very broad spectrum. As you mentioned,

logs and metrics tend to be some of mid spectrum.

They are technical, but not as

super deep. We're seeing on the 1 end of the spectrum business metrics, which are logins,

sign ups,

revenues generated, and so on. And we're seeing those metrics growing in importance as software in general is becoming a bigger part of the business.

So do the metrics

are required to measure that. And on the other side of the spectrum, we are seeing very, very technical data being extracted, heap dump, stack traces,

and so on that allow you to truly see the exact state of the application so that you can better understand it. And obviously, for things like like the amount of revenue generated, that's something that people have tracked for a while usually by dumping the actual sales reports at the end of the quarter, for instance.

So then as far as the types of business questions that can be answered from some of these different types of data that you're collecting, what are some of the most interesting or most pressing ones that you've seen or that some of your customers have been able to ask and answer? Kuppy (3five 50 seven): So we are seeing that with the rise of software and digitalization

processes,

software engagement is becoming ever more important, and the various business stakeholders,

be it sales, marketing, customer success, or product, becoming keener and keener to know exactly what's going on, to know how much the money is made, to know which part of the product are making that money, which part of the product are making it are creating engagement. And as new features are being rolled out, new market segments are being penetrated.

Those questions are ever evolving.

As software processes are becoming bigger part of the business, we're seeing the business stakeholders,

sales, marketing, and others ever more interested in,

answering questions

about how the software is behaving with real world customers. Those questions tend to be the same and fall into a few broad categories.

How many people are signing up and logging into my software?

How do people interact with my software, which part of this the software do they interact with, and how often. But the interesting thing is how often do those questions change.

And what we're seeing

is that the as time passes,

business stakeholders

have to answers more questions,

and they keep ask changing them more frequently. And that's a big difference. You're not just answering the question of how much revenue am I generating per quarter, but

every day, every week, you have new stakeholders

asking new questions, and you have to iterate on the data you are collecting to answer those questions.

And for being able to collect that additional data, as you mentioned, some of the traditional means that developer teams and

application. Add new points that they're collecting specific metrics

in the application. So what are some of the considerations that developers and operations engineers need to be aware of when they are defining those collection points for the types of metrics and log messages that they're collecting? And what do you see as some of the effective strategies for being able to incorporate the different business stakeholders in that process of defining those collection points? So that's exactly right. The traditional method for extracting data is by writing more instrumentation code to get that data.

And there are a bunch of considerations here. The

basic, more engineering oriented questions have to do with how do I make sure it doesn't impact my performance, how do I make sure it doesn't impact my reliability,

how do I make sure it doesn't change the program logic, that I'm not accessing a variable that looks like an attribute, but is in fact a function? And with those elements,

you just have to do traditional

techniques,

unit testing, QA testing, and so on. But at the same time, there are whole slew of other considerations

that have to do with security,

data governance, compliance,

as well as how do I route the data to the right stakeholder in the right format.

And none of those tend to be trickier requirements to meet that require a lot of documentation

of what is allowed and isn't allowed, and implementing that into the everyday engineering processes.

And then another aspect

of the type of data that you're collecting is ensuring that you have the appropriate context because a single data point isn't necessarily going to be very meaningful unless you have some of the broader context such as a single login session

might be useful

in some cases, but you want to know has this person ever logged in before? What are they trying to achieve? And so what have you found to be some of the ways of being able to capture some of that context and define it? And then once it is captured, how

to collocate that information with the data after it's being collected and as it traverses different systems.

That's exactly right. 1 of the challenges we're seeing customers struggle with and is often the reason why they have to reiterate

over and over on data collection

is that they you have to get the whole piece of data. It's not enough to get the variable. It's not enough to get the metric. You have to be able to associate it to the relevant,

user

segment,

maybe to the relevant server based on what you're gonna do with the data. And then once you have the data, you have to make sure that you bring it as a whole to rest in the right system side by side with other pieces of data so that you can correlate it to. And if you end up forgetting a variable or sending it to the right system, more often than not, you're gonna have to do another reiteration just to get it right. Another element of the collection of this operations data is that often we're using multiple different systems

data might end up going into some hosted log provider or your own Elasticsearch cluster. Metrics might be going into something like Graphite

or you might use Datadog. And then being able to correlate all that information together is either very difficult or potentially impossible. And I'm wondering what you see as some of the just overall shortcomings of the existing landscape of data collection

and access, and some of the strategies that you have found useful for being able to capture the full context

and the full suite of information into 1 location for being able to actually ask useful questions of it? I actually think

this is 1 of the necessary evil of a data driven organization. The most stakeholders you have that needed data and the more varied questions you use the data to answer, then it's inevitable

that you will end up using multiple systems. Each system for the relevant stakeholder, each system

for the relevant data and for answering relevant questions. And we're seeing that most customers end up duplicating the data between systems. And that's okay because

no 1 system is a silver bullet, and no 1 system can process all the data effectively and answer all the relevant questions.

But

this is also making the entire process more challenging because when you're sending the data out to multiple systems, it's becoming harder to know what data am I sending where, what are the compliance

and regulations for each of those data sources and for each of the data targets, and how do I keep my, costs in check? Because every piece of data you collect is gonna cost you something and every additional target is gonna cost you. And so while I think it's, it's good and we do need multiple targets and it's okay to duplicate the data from them, we also need better tools to manage the pipeline, to manage data extraction, and to manage

the way we send the data out to the various targets. And on that data extraction piece, I know that the platform that you're building out in Rookout is 1 solution to that as far as making it so that you don't have to go through that full release cycle to add in new log messages or add in new metrics. But wondering if you can talk a bit about how that works, particularly in terms of onboarding some of the non technical stakeholders in the business,

and

the persistence or the overall workflow

of once you have defined an ad hoc metric or an ad hoc log line, how do you incorporate that into the next release of your platform?

So Lookout is a data extraction tool that allows you to instantly see into your app and collect new information. That information can be in the form of log lines. It can be in the forms of metrics. It can even be in the form of entire snapshots of the application state.

And Rocout is focusing on a few elements in that area. The first is agility,

getting you the data you need when you need it on the fly. And the second is reducing the skill set. It's about reducing the the risks factor and allowing people who are not as skilled, who are not as familiar with the code, might not even be familiar with the language

to specify what they need, whether

specify a line of code or specify a term they're interested in looking for and get the data based on that query. And Lookout even allows you to pipeline the data to the final target, whether it's Datadog or Graphite

or Elasticsearch

or any of those other tools in any of the formats. And you can even get a single data point and duplicate it to multiple targets based on your needs. And any 1 of those elements, what you collect, what you do with the data, how do you transform it, and how do you send it, all of it can be changed within simple configuration in a matter of seconds so that you have full control and full visibility into what you're collecting and what you're doing with it. For the pipeline that you're building out, what are some of the tools that you're using for being able to enable that data collection and pipelining and manage

Rocout is built on a multiple services.

The first resides within the customer's application and is essentially an SDK that allows you to extract the data on the fly from the relevant portions of the app. The second service is an ETA component written in Golang for efficiency, and this second service is in charge of the ETL process. It takes the raw data, reducts it based based on security policies,

transforms it to

the relevant target,

format, whether it's JSON or XML or just a string, and then sends it out to the final target in the most efficient and simple way possible. And the entire process is orchestrated from a single pane of glass, the RoCout SaaS service, and this allows you to keep everything in check and implement various

organizational and operational policies on the process end to end. For

the full life cycle of the data collection piece, once you have

defined something in the Rookout panel,

is it then something that needs to be incorporated back into the code by a developer to ensure that it's included

going forward? Or how do you manage the overall life cycle of these collection points? And what's the interaction between the stakeholders and the engineers for defining what the useful context and what the useful lifespan is of that collection point where is it something that's generally a 1 off where somebody is just doing some quick sampling of the data? Or is it often something where they discover an additional data point that they need, and then they want to collect that further prior to the future for the full duration of the application?

That's a matter of taste. We're seeing various customers

handling it differently. And on the 1 hand, there are various technical considerations.

But even more importantly, I think it's a matter of personal preference. We're seeing some of our customers that prefer everything to be in git. Everything should be as part of the source code. Everything should be versioned. Everything should reside in the same place. And those customers tend to,

set temporary breakpoints with Lookout, tend to iterate on them, tend to figure out exactly what the data they need is. And then once they've gone through the discovery process and they know exactly what they need, then they just implement it into the code in a single in a single instance. On the other hand, we are seeing other customers

that are very happy with having those

configurable

policy based tools, collecting them the data they need. And for them, those breakpoints can be

around for many months collecting the data with no

detrimental effect on performance or anything else. And then as the application continues to change, what are some of the risks of those break points in Rookout being

no longer valid as the particular source

changes? Or if they set a break point on a particular line of code and then that line is modified in some way or removed

by the just natural life cycle of the application, how does that impact the overall ability for a Rocout to be able to collect that information?

Rocout has multiple mechanisms in place to identify source changes. And so

we identify

if this file has changed, if this line has changed. And I can say for the most part, we're seeing that breakpoints are very, very stable.

Once a breakpoint has been set, more often than not, it can stay up for months without any interruption.

But once an interruption does occur, and if minor interruptions the system can cope with independently,

say a line you've added a couple of new lines and the line number is dropped from a 72 to 76, then

we're gonna identify that and fix it on the fly for you. But let's say the line you've added the breakpoint on was deleted. In that case, the system is gonna detect it and alert you that the breakpoint is no longer valid, and that you'll have to update it based on your needs. For your own use cases, what are some of the ways that you have used Rookout to be able to add useful information

into your data collection for solving some of the business needs that you faced and feedback into the overall product life cycle?

So early in the workout life cycle,

we were just starting out with the local platform in general and our self-service capability

specifically.

And 1 of the things our sales engineers wanted to know was how often do people sign up, and how do they use the system once they sign up. And so instead

of having our engineers

spend a sprint

on that analytics,

work, they did it themselves. So 1 of our solution engineers logged into Rocout. We have a dog food environment, essentially, Rocout on Rocout. And then he sets the various

set Rocout breakpoints on a few of the relevant functions, such as login, sign up, setting a breakpoint, getting data.

And so we could monitor those events and even routed them to Slack, so that he essentially got Slack notifications

whenever interesting things occurred in the system. And all of that without involving the engineering team or requiring any resources on their end. Another interesting element of the overall collection of this data is

the life cycle

after it's been collected where certain operational stores you might only want to keep around information for

the duration of, you know, days or weeks, possibly

up to a certain number of months. Where if you're trying to do more long range analytics, you'll often want to be able to collect information over the course of months years for doing things like seasonal analysis

and

being able to

view trends in the data. And from your experience,

do you tend to replicate the operations information into a more long term analytics store or something like s 3 to be able to handle some of these more long term analytics beyond the useful life cycle and the operations context? And what are some of the useful strategies that you found on that front? We're seeing that most of our customers tend to have a fairly complex analytics pipelines.

This is probably due to the various business requirements coming up gradually and those solutions being implemented. We're seeing a lot of of Elasticsearch as a service solutions, such as Logs IO. We're seeing a lot of Kafka clusters being used, especially where complex

data processing is acquired. And of then we're seeing, s 3 buckets for alc archiving purposes. And the nice thing about locality is that whatever your

needs are, the data can be sent to there. It can even be sent to multiple nations at the same time. And while we try to help our customers navigate that area, for the most parts,

they tend to know what they are doing with the data once they've got it. And in your experience of building out this platform for being able to do more ad hoc data collection and bring more people into the process

of defining these collection points? What are some of the more

interesting or unexpected or challenging lessons that you've learned in the process?

I think what surprised me is the most was that often engineers have little understanding

of how their code is running in production. Even the concept of, I mean, as engineers, especially in the cloud native era, we're often being told of stateless services and why they matter so much.

But

today, with many

organizations still having some level of divide between development and ops, engineers often have little understanding of how their code is deployed, how many instances are there, where are deployed they are deployed across regions, and so on. And we found that that help that hurts

engineer ownership of their code

because they are missing a big piece of the puzzle of how this code is actually running, and so it's very it's detaching them from the day to day meaning of it. And we found that Rookout is making a huge difference there because it it helps the engineers

to see how their code is deployed, to see how their code is running, and it's creating new and interesting conversations

within organizations

between engineering and operations once there is better visibility

in a more common language.

And the operations data is definitely something that can be fairly clearly viewed as something that's valuable from the business context because oftentimes it's being collected from the software that is either the entire source of income for the business or something that's critical to the support of the overall

business operations. But what are some of the other sources of dark data that you have found to be useful and that as engineers and as business stakeholders in an organization, we should be thinking about or keeping an eye out for?

I believe we're in many ways

underestimating

the valuable data residing in our software. You as you've mentioned, software is becoming more central to the business process. There is an even more powerful truth, and that that

a class traditional IT metrics are going away. We have to move to the cloud, we have to move to infrastructure as a service. Even if you're on a private cloud,

individual servers don't matter as much as they used to. Nobody cares how much free space there is in a on a hard drive or what's the CPU utilization in a single node. As we are obstructing away the hardware and often the operating system, software is becoming what matters. Take a look at serverless, for instance. There is nothing but software as far as weaker. And so there is an ever ending value,

and there's never ending source of interesting information

in our software, whether it's for designing new features, fixing bugs, monitoring

the business performance,

and it's all there ready for the taking if we just go ahead and get it. And are there any other aspects

of the process of

collecting these metrics and information from the software that we're running or the value that can be obtained from the information that's hiding in those systems, or the overall process of leveraging dark data in an organization that we didn't discuss yet that you'd like to cover before we close out the show or any other aspects of the work that you're doing at Rookout? Yeah. So actually, another interesting thing we're seeing with customers is that once data does go into the analytics system, it's quite often that

it's it's problematic.

Sometimes it's just noise. It's bad data that in fact needs to be removed.

Sometimes

it's it might be useful data, but they have no idea where where the data is coming from or what does it mean sometimes. And we are often seeing that governing the data you already have, that's a huge problem all by itself of how do I gain understanding

of what's the data that is in there? Why is it coming? How do I clean it up? And those challenges are sometimes even harder than getting new data into the system. And again, we believe that single platform,

single pipeline with full end to end governance is critical for meeting those challenges. Yeah. Data governance and data lineage management is definitely 1 of the

continuing challenges that's becoming ever more relevant as we add new and more varied sources of data and more complex analytics on top of it. Definitely. And are there any other aspects of this conversation that we didn't go deep enough into that you'd like to talk about some more, or do you think we did a pretty good job of covering it? I think we did a pretty good job.

Alright. So for anybody who wants to follow along with the work that you're doing, Leron, or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I think the biggest gap is that

the way we extract data is embedded into our source code. And as as much as we've gone a very long way over the last decade or so, and we are delivering software at an ever increasing pay rate. We can only deliver software so fast. We can only change software so fast. And updating an entire

server, sometime when you're updating an entire fleet of servers just to flip a single byte in memory so that you can get a new metric, That's a huge waste of time, effort, compute power.

We need it, by the way, we need a way to extract the data we need without going through this terribly expensive process. Because as we become more and more data driven

and data becomes even more valuable,

it's critical for us to be able to cheaply and easily get the data we need to do our jobs and grow our business.

Alright. Well, thank you very much for taking the time today to join me and share your perspective and experience

on being able to separate out the data collection from the source code deployment and all the work that you're doing at Rookout. It's definitely a very interesting platform and I look forward to seeing what you do there. So thank you for all of that and I hope you enjoy the rest of your day. Thank you. You too.

Listening. Don't forget to check out our other show, podcast.init

atpythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at data engineering podcast dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links