Making Data Collection In Your Code Easy With Rookout - Episode 128
April 14, 2020
The software applications that we build for our businesses are a rich source of data, but accessing and extracting that data is often a slow and error-prone process. Rookout has built a platform to separate the data collection process from the lifecycle of your code. In this episode, CTO Liran Haimovitch discusses the benefits of shortening the iteration cycle and bringing non-engineers into the process of identifying useful data. This was a great conversation about the importance of democratizing the work of data collection.
Do you want to try out some of the tools and applications that you heard about on the Data Engineering Podcast? Do you have some ETL jobs that need somewhere to run? Check out Linode at dataengineeringpodcast.com/linode or use the code dataengineering2019 and get a $20 credit (that’s 4 months free!) to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
Your host is Tobias Macey and today I’m interviewing Liran Haimovitch, CTO of Rookout, about the business value of operations metrics and other dark data in your organization
How did you get involved in the area of data management?
Can you start by describing the types of data that we typically collect for the systems operations context?
What are some of the business questions that can be answered from these data sources?
What are some of the considerations that developers and operations engineers need to be aware of when they are defining the collection points for system metrics and log messages?
What are some effective strategies that you have found for including business stake holders in the process of defining these collection points?
One of the difficulties in building useful analyses from any source of data is maintaining the appropriate context. What are some of the necessary metadata that should be maintained along with operational metrics?
What are some of the shortcomings in the systems we design and use for operational data stores in terms of making the collected data useful for other purposes?
How does the existing tooling need to be changed or augmented to simplify the collaboration between engineers and stake holders for defining and collecting the needed information?
The types of systems that we use for collecting and analyzing operations metrics are often designed and optimized for different access patterns and data formats than those used for analytical and exploratory purposes. What are your thoughts on how to incorporate the collected metrics with behavioral data?
What are some of the other sources of dark data that we should keep an eye out for in our organizations?
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the projects you hear about on the show, you'll need some more to deploy it. So check out our friends over at linode. With 200 gigabit private networking, scalable shared block storage, a 40 gigabit public network fast object storage and a brand new managed Kubernetes platform you get everything you need to run a fast, reliable and bulletproof data platform. And for your machine learning workloads. They've got dedicated CPU and GPU instances, go to data engineering podcast.com slash linode today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. Your host is Tobias Macey, and today I'm interviewing Leon haymitch, CTO of rook out about the business value of operations metrics and other dark data in your organization. So Lauren, can you start by introducing yourself?
Hey, Tobias, it's great to be here. I'm Liran Haimovitch. I'm records co founder and CTO. I've spent most of my career doing cyber security. And for the last three and a half years, I'm the co founder and CTO of lookout, where we deal with data extraction data fetching. And in general, how can we easily get the data we need on the fly?
about three and a half years ago, my co founder and I wanted to found a startup something in the area of DevOps and dev tooling. And as we dug deeper into it, we were fascinated by how hard it was to collect data from our own software. I mean, it's our code, it's running on our servers, and yet, it could be ridiculously hard to get a piece of data from it, a stack trace a variable value, and as we dug deeper into it with our cyber security experience, we came up with a novel way of how can you extract the data on the fly without going through the traditional process of embedding the data extraction code into our application code. And so
for the types of data that we're talking about, obviously, in your application, some of the types of things you're looking for are log data or system metrics. But what are some of the other ways that that data manifests and the types of information that you might be looking to extract
so that data manifests in a very, very broad spectrum. As you mentioned, a logs and metrics tend to be some of mid spectrum. They are technical, but not super deep. We're saying on the one end of the spectrum, business metrics, which are logins, signups, revenues generated, and so on, and we're seeing those metrics growing in importance is software in general is becoming a bigger part of the business. So do the metrics required to measure that and on the other side of the spectrum, you'll see Very, very technical data being extracted heap dump stack traces, and so on that allow you to truly see the exact state of the application so that you can better understand it.
And obviously, for things like the amount of revenue generated, that's something that people have tracked for a while, usually by dumping the actual sales reports at the end of the quarter, for instance. So then, as far as the types of business questions that can be answered from some of these different types of data that you're collecting, what are some of the most interesting or most pressing ones that you've seen or that some of your customers have been able to ask an answer?
So we're seeing that with the rise of software in digitalization processes, software engagement is becoming ever more important and the various business stakeholders visit sales marketing, customer success or product becoming keener and keener to know exactly what's going on to know how much the money's made to know which part of the project making that money which part of the project Making you are creating engagement. And as new features are being rolled out, new market segments are being penetrated. Those questions are ever evolving. As software processes are becoming bigger parts of the business, we're seeing the business stakeholders, sales, marketing and others ever more interested in answering questions about how the software is behaving with real world customers, those questions tend to be the same and fall into a few broad categories. How many people are signing up and logging into my software? How do people interact with my software? which parts of the software do they interact with and are often but the interesting thing is, how often do those questions change? And what we're seeing is that as time passes, business stakeholders have to answer more questions and they keep changing them more frequently. And that's a big difference. You're not just answering the question of how much revenue Am I generating per quarter but Every day, every week, you have new stakeholders asking new questions, and you have to iterate on the data you're collecting to answer those questions.
And for being able to collect that additional data. As you mentioned, some of the traditional means that developer teams and operations engineers might go about it is to go through the code and build and release cycle where they need to add new log lines or explicitly add new points that they're collecting specific metrics in the application. So what are some of the considerations that developers and operations engineers need to be aware of when they are defining those collection points for the types of metrics and log messages that they're collecting? And what do you see is some of the effective strategies for being able to incorporate the different business stakeholders in that process of defining those collection points?
So that's exactly right. The traditional method for extracting data is by writing more instrumentation code to get the data and the bunch of consumers durations here. The basic more engineering oriented questions have to do with how do I make sure it doesn't impact my performance? How do I make sure it doesn't impact my reliability? How do I make sure it doesn't change the program logic, that I'm not accessing a variable that looks like an attribute, but is in fact, the function. And with those elements, you just have to do traditionally, techniques, unit testing, QA, testing, and so on. But at the same time, though, a whole slew of other considerations that have to do with security, data governance, compliance, as well as how do I route the data to the right stakeholder in the right format. And none of those tend to be tricky requirements to meet that require a lot of documentation of what is allowed and isn't allowed and implementing that into their everyday engineering processes.
And then another aspect of the type of data that you're collecting is ensuring that you have the appropriate context because a single data point isn't necessarily Going to be very meaningful unless you have some of the broader context, such as a single login session might be useful in some cases, but you want to know has this person ever logged in before? What are they trying to achieve? And so what have you found to be some of the ways of being able to capture some of that context and define it, and then once it is captured how to co locate that information with the data after it's being collected, and as it traverses different systems?
That's exactly right. One of the challenges we're seeing customers struggle with and is often the reason why they have to reiterate over and over on data collection is that you have to get the whole piece of data, it's not enough to get a variable, it's not enough to get a metric, you have to be able to associate it to the relevant a user segment, maybe to the relevant server based on what you're going to do with the data. And then once you have the data, you have to make sure that you bring it as a whole to rest in the right system side by side with other pieces of data so that you can call it and if you end up forgetting a variable or sending it to the right system more often than not, you're gonna have to do another iteration just to get it right.
Another element of the collection of this operations data is that often we're using multiple different systems for collecting that information. So for instance, with stack traces, you might use something like century or roll bar for being able to capture that information, log data might end up going into some hosted log provider or your own Elastic Search cluster metrics might be going into something like graphite, or you might use data dog and then being able to correlate all that information together is either very difficult or potentially impossible. And I'm wondering what you see as some of the just overall shortcomings of the existing landscape of data collection and access and some of the strategies that you found useful for being able to capture the full context and the full suite of information into one location for being able to actually ask you some questions. I
actually think this is one of the necessary evils of a data driven organization to most stakeholders you have that needed data and the more varied questions you use the data to answer, then it's inevitable that you will end up using multiple systems in each system for the relevant stakeholder, each system with the relevant data and for inserting the relevant questions, and we're seeing that most customers end up duplicating the data between systems. And that's okay, because no one system is a silver bullet. No one system can process all the data effectively and answer all the relevant questions. But this is also making the entire process more challenging, because when you're sending the data out to multiple systems, it's becoming harder to know, what data Am I sending, where what are the compliance and regulations for each of those data sources and for each of the data targets? And how do I keep my costs in check, because every piece of data you collect is going to cost you something and every additional target is going to cost you and so While I think it's a, it's good, and we do need multiple targets, and it's okay to duplicate the data for them, we also need better tools to manage the pipeline to manage data extraction and to manage the way we send the data out to the various targets.
And on that data extraction piece, I know that the platform that you're building out and rook out is one solution to that as far as making it so that you don't have to go through that full release cycle to add a new log messages or add a new metrics. But wondering if you can talk a bit about how that works, particularly in terms of onboarding some of the non technical stakeholders in the business and the persistence or the overall workflow of once you have defined an ad hoc metric or an ad hoc logline, how do you incorporate that into the next release of your platform?
So real code is a data extraction tool that allows you to instantly center up and collect new information that information can be in the form of a loglines can be in the form of metrics, it can even be in the form of entire snapshots of the application state. And row count is focusing on a few elements in that area. The first is agility, getting you the data you need when you need it on the fly. And the second is reducing the skillset. It's about reducing the risk factor. and allowing people who are not as skilled or not as familiar with the code might not even be familiar with the language to specify what they need whether specify a line of code or specify a term they're interested in looking for and get the data based on that query. And record even allows you to pipeline the data to the final target, whether it's data dog, or graphite, or Elastic Search or any of those other tools in any of the formats. And you can even get a single data point and duplicated to multiple targets based on your needs. And any one of those elements. What you collect what you do with the data, how do you transform it and what do you send it all of it can be changed with that. In simple configuration in a matter of seconds so that you have full control and full visibility into what you're collecting and what you're doing it
So row count is built on a multiple services. The first resides in the customer's application, and is essentially an SDK that allows you to extract the data on the fly from the relevant portions of the app. The second service is an ETL component, written in golang for efficiency, and this second service is in charge of the ETL process. It takes the raw data reduction rates based on security policies, transforms it to the relevant target format, whether it's JSON or XML, or just a string and then sent it out to the final target in the most efficient and simple way possible, and the entire process is orchestrated. From a single pane of glass, they record service. And this allows you to keep everything in check and implement various organizational and operational policies on the process end to end
for the full lifecycle of the data collection piece. Once you have defined something in the lookout panel, is it then something that needs to be incorporated back into the code by a developer to ensure that it's included going forward? Or how do you manage the overall lifecycle of these collection points? And what's the interaction between the stakeholders and the engineers for defining what the useful context and what the useful lifespan is of that collection? point? Where is it something that's generally a one off where somebody is just doing some quick sampling of the data? Or is it often something where they discover an additional data point that they need and then they want to collect that further prior to the future for the full duration of the application? That's
a matter of taste. We're seeing various customers handling differently. On the one hand, there are various technical considerations. But even more importantly, I think it's a matter of personal preference. We're seeing some of our customers that prefer everything to be engaged, everything should be as part of the source code, everything should be versioned, everything should reside in the same place. And those customers tend to a set a temporary break, when to smoke out, tend to iterate on them tend to figure out exactly what data is needed. And then once they've gone through the discovery process, and they know exactly what they need, then they just implement it into the code in a singular in a single instance. On the other hand, we are seeing other customers that are very happy with having those configurable policy based tools, collecting them the data they need, and for them, those breakpoints can be around for many months collecting the data with no detrimental effect on performance or
anything is and then as the application continues to change, what are some of the risks of that? Those breakpoints and lookout being no longer valid as the particular source changes, or they set a breakpoint on a particular line of code. And then that line is modified in some way or removed by the natural life cycle of the application. How does that impact the overall ability for workout to be able to collect that information
Roco has multiple mechanisms in place to identify source changes. And so we identify if this file is changed if this line has changed, and I can say for the most part, we're seeing that breakpoints have very, very stable once the breakpoint has been set, more often than not, it can stay up for months without any interruption. But once an interruption does occur, and if minor interruption to the system can cope with independently, say a line, you've added a couple of new lines and the line number is dropped from a 72 to 76.
Then we're going to identify and fix it on the fly for you. But let's say the line you've added a breakpoint on line deleted in that case, the system is going to detect it and alert you that the breakpoint is no longer valid, and that you have to update it based on your needs for your own use cases, what are some of the ways that you have used rook out to be able to add useful information into your data collection for solving some of the business needs that you faced and feed back into the overall product lifecycle?
So early in the record lifecycle, we were just starting out with the record platform in general in our self service capability, specifically, and one of the things our sales engineers wanted to know was, how often do people sign up? And how do they use the system once they sign up? And so instead of having our engineers spend a sprint on that analytics, a walk, they did it themselves. So one of our solution engineers logged in to rock out we have a dog food environment, essentially rock out on rock out and then he sets the various sets Through code breakpoints on a few of the relevant functions, such as logging sign up setting a breakpoint to getting data. And so we could monitor those events and even routed them to slack so that he essentially got slack notifications whenever interesting things occurred in the system. And of that without involving the engineering team or requiring any resources
on their end. Another interesting element of the overall collection of this data is the lifecycle after it's been collected, where certain operational stores you might only want to keep around information for the duration of, you know, days or weeks, possibly up to a certain number of months, where if you're trying to do more long range analytics, you'll often want to be able to collect information over the course of months and years for doing things like seasonal analysis and being able to view trends in the data. And from your experience. Do you tend to replicate the operations information into a more long term analytics store or something like s3 To be able to handle some of these more long term analytics beyond the useful life cycle in the operations context, and what are some of the useful strategies that you found on that front,
we're seeing that most of our customers tend to have a fairly complex analytics pipelines. This is probably due to the various business requirements coming up gradually. And those solutions being implemented. We're seeing a lot of Elastic Search as a service solutions such as logs IO, we're seeing a lot of Kafka clusters being used, especially where complex a data processing is required. And off, then we're seeing s3 buckets for archiving purposes. And the nice thing about record is that whatever your needs are, the data can be sent to there. It can even be sent to multiple nations at the same time. And while we try to help our customers navigate that area for the most about, they tend to know what they're doing with the data once they've got it
and in your experience, so building out this platform for being able to do more ad hoc data collection and bring more people into the process of defining these collection points, what are some of the more interesting or unexpected or challenging lessons that you've learned in the process?
I think what surprised me the most was that often engineers have little understanding of how the code is running in production, even the concept of I mean, as engineers, especially in the cloud native era, we're often being told of stateless services and why they matter so much. But today, with many organizations, still having some level of divide between development and ops engineers often have little understanding of how the code is deployed, how many instances out there, well deployed, zero deployed across regions and so on. And we found that that helps, that hurts engineers ownership of their code, because they're missing a big piece of the puzzle of How this code is actually running in. So it's very, it's the touching them from the day to day meaning of it. And we found that rook out is making a huge difference there. Because it, it helps the engineers to see how the code is deployed to see how the code is running. And it's creating new and interesting conversations within organizations between engineering and operations once there is better visibility in a
more common language. And the operations data is definitely something that can be fairly clearly viewed as something that's valuable from the business context, because oftentimes, it's being collected from the software that is either the entire source of income for the business or something that's critical to the support of the overall business operations. But what are some of the other sources of dark data that you have found to be useful and that as engineers and as business stakeholders in an organization we should be thinking about or keeping an eye out for
I believe for us In many ways, under estimating the valuable data residing in our software, you as you've mentioned, software is becoming more central to the business process, there is an even more powerful truth. And that is a classic traditional IQ metrics are going away with the move to the cloud with the move to Infrastructure as a Service. Even if you're on a private cloud, individual servers don't matter as much as they used to. Nobody cares how much free space there is, on the hard drive, or what's the CPU utilization in a single node is we're abstracting away the hardware and often the operating system software is becoming what matters. Take a look at serverless for instance, there is nothing but software as far as weaker. And so there is an ever ending value. And there is never any source of interesting information in our software, whether it's for designing new features, fixing bugs, monitoring, a business performance, and it's all there ready for the taking. If we just go ahead and get it
and are there Any other aspects of the process of collecting these metrics and information from the software that we're running, or the value that can be obtained from the information that's hiding in those systems, or the overall process of leveraging dark data in an organization that we didn't discuss yet that you'd like to cover before we close out the show or any other aspects of the work that you're doing a workout?
Yes. So another interesting thing we're seeing with customers is that once data does go into the analytic system, it's quite often that it's it's problematic. Sometimes it's just noise. It's bad data that in fact needs to be removed. Sometimes it's it might be useful data, but they have no idea where the data is coming from, or what does it mean sometimes, and we're often seeing that governing the data you already have. That's a huge problem all by itself, of how do I gain understanding of what's the data that is in there? Why is it coming? How do I clean it up, and the Challenges are sometimes even harder than getting new data into the system. And again, we believe that single platform single pipeline with full end to end governance is critical for meeting those challenges.
Yeah, data governance and data. lineage management is definitely one of the continuing challenges that's becoming ever more relevant as we add new and more varied sources of data and more complex analytics on top of it. Definitely. And are there any other aspects of this conversation that we didn't go deep enough into the like to talk about some more? Or do you think we did a pretty good job of covering it? I
All right. So for anybody who wants to follow along with the work that you're doing the Ron or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
I think the biggest gap is that the way we extract data is embedded into our source code. And as much as we've gone a very long way over the last decade or so, and we are delivering software at an ever increasing rate, we can only deliver software so fast, we can only change software so fast and updating an entire server sometime in updating an entire fleet of servers just to flip a single byte in memory so that you can get a new metric. That's a huge waste of time, effort, compute power. And we need a better way, we need a way to extract the data we need without going through this terribly expensive process. Because as we become more and more data driven, and data becomes even more valuable, it's critical for us to be able to cheaply and easily get the data we need to do our jobs and grow business.
All right, well, thank you very much for taking the time today to join me and share your perspective and experience on being able to separate out the data collection from the source code deployment and all the work that you're doing at cookout, such Very interesting platform and I look forward to seeing what you do there. So thank you for all of that and I hope you enjoy the rest of your day.
Listening Don't forget to check out our other show podcast.in it at Python podcast comm to learn about the Python language its community in the innovative ways it is being used, and visit the site at data engineering podcast calm to subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story, and to help other people find the show. Please leave a review on iTunes and tell your friends and co workers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!