DataOps For Streaming Systems With Lenses.io - Episode 140

Summary

There are an increasing number of use cases for real time data, and the systems to power them are becoming more mature. Once you have a streaming platform up and running you need a way to keep an eye on it, including observability, discovery, and governance of your data. That’s what the Lenses.io DataOps platform is built for. In this episode CTO Andrew Stevenson discusses the challenges that arise from building decoupled systems, the benefits of using SQL as the common interface for your data, and the metrics that need to be tracked to keep the overall system healthy. Observability and governance of streaming data requires a different approach than batch oriented workflows, and this episode does an excellent job of outlining the complexities involved and how to address them.

Datadog LogoDatadog is a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog delivers complete visibility into the performance of modern applications in one place through its fully unified platform—which improves cross-team collaboration, accelerates development cycles, and reduces operational and development costs.

Try Datadog in your environment with a free 14-day trial—and get a complimentary t-shirt if you install the agent. Go to datadog.com/dataengineeringpodcast to get started!

Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $60 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
  • Your host is Tobias Macey and today I’m interviewing Andrew Stevenson about Lenses.io, a platform to provide real-time data operations for engineers

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what Lenses is and the story behind it?
  • What is your working definition for what constitutes DataOps?
    • How does the Lenses platform support the cross-cutting concerns that arise when trying to bridge the different roles in an organization to deliver value with data?
      • What are the typical barriers to collaboration, and how does Lenses help with that?
  • Many different systems provide a SQL interface to streaming data on various substrates. What was your reason for building your own SQL engine and what is unique about it?
  • What are the main challenges that you see engineers facing when working with streaming systems?
  • What have you found to be the most notable evolutions in the community and ecosystem around Kafka and streaming platforms?
  • One of the interesting features in the recent release is support for topologies to map out the relations between different producers and consumers across a stream. Why is that a difficult problem and how have you approached it?
  • On the point of monitoring, what are the foundational challenges that engineers run into when trying to gain visibility into streams of data?
    • What are some useful strategies for collecting and analyzing traces of data flows?
  • As with many things in the space of data, local development and pre-production testing and validation are complicated due to the potential scale and variability of a production system. What advice do you have for engineers who are trying to establish a sustainable workflow for streaming applications?
    • How do you facilitate the CI/CD process for enabling a culture of testing and establishing confidence in the correct functionality of your systems?
  • How is the Lenses platform implemented and how has its design evolved since you first began working on it?
  • What are some of the specifics of Kafka that you have had to reconsider or redesign as you began adding support for additional streaming engines (e.g. Redis and Pulsar)?
  • What are some of the most interesting, unexpected, or innovative ways that you have seen the Lenses platform used?
  • What are some of the most interesting, unexpected, or challenging lessons that you have learned while working on and with Lenses?
  • When is Lenses the wrong choice?
  • What do you have planned for the future of the platform?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:12
Hello, and welcome to the data engineering podcast the show about modern data management. What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I'm working with O'Reilly on a project to collect the 97 things that every data engineer should know and I need your help. Go to data engineering podcast comm slash 97 things to add your voice and share your hard earned expertise. And when you're ready to build your next pipeline and want to test out the project to hear about on the show, you'll need somewhere to deploy it. So check out our friends over at linode. With their managed Kubernetes platform, it's now even easier to deploy and scale your workflow so try out the latest Helm charts from tools like pulse our package derma Daxter, with simple pricing, fast networking, object storage and worldwide Data Centers, you've got everything you need to run a bulletproof data platform, go to data engineering podcast comm slash linode. That's l i n od e today and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macey. And today I'm interviewing Andrew Stevenson about lenses a platform to provide real time data operations for engineers. So Andrew, can you start by introducing yourself?
Andrew Stevenson
0:01:27
Yeah. Hello. Great that you have me. I'm Andrew. I'm the CTO of lenses. So I've been working in data now for for a long time, 20 years. So I started off as a c++ developer, and ended up being a big data specialist around Kafka and now the CTO of lenses.
Tobias Macey
0:01:46
And do you remember how you first got involved in the area of data management?
Andrew Stevenson
0:01:49
I think I was always involved in it. If I if I look back, even when I was doing civil engineering, I was still collecting data for a pressure management system actually and even doing similarly Plus, it was still matching and settling trades in a clearing house. So it's always been there. I think it actually really became debt management when I was at a high frequency trading firm. And we were doing a lot of big data movement before it was what was called Big Data back then using a lot of the Microsoft stack, actually. So providing a lot of real time analytics to the trading. So I think that's where I truly transitioned into a full blown data engineer, but I think it's always been there. For me, it's always been a constant. You know, I will see your talks about data being a protagonist, it always was present in everything I was doing even when I was more of a traditional developer.
Tobias Macey
0:02:44
And so in terms of lenses itself, you mentioned that you happened upon working with Kafka in the streaming space. So I'm wondering if you can give a bit of a background about what the lenses product is and some of the story behind how it got started.
Andrew Stevenson
0:02:57
Yeah, so I think there is a product now People on companies are starting to realize and wanting to get value from their data and more and more real time. So when as I was working as a data contractor and various companies normally did the investment banking scene in London, I came across antonius, the CEO, and we were trying to implement many projects on top of Kafka and spark and all this and actually seeing the difficulty that we have there. So we thought, okay, we can address this. So actually lenses what we're trying to do is trying to take the pain and the cost and the complexity out of doing that. So you can actually build real time data driven applications easier. And an important aspect for that is actually bringing the business back into these deep projects. I think the business got sidelined too much. And all the focus was on the technology. And where I've seen the success was always by being able to bring the business context to any data that we're using. For example, I'm a technologist. I'm not a expert in market risks. But I saw these people getting sidelined. So we were How can we bring the tooling to be able to get these people back involved so we can make projects a success, open source is great. But if we can't make it a success, and we can't get the tooling around it and bring the business context to it, that's where I've seen the failure. So that's where we ended up building building lenses out of we first started building a lot of open source connectors. So we have the stream reactor, which is an open source project. I started as 30 odd Catholic connectors in there. And from that we went on to actually turn it into real product, again, with the focus for us is really at the application layer at the data layer making use of that. So for example, we have one client, Babylon health is a unicorn health provider out of out of the UK, and their goal is to provide affordable health care for everyone on the planet. And they will use tech intensity to do this commodity software so they can actually get back to what's important, right? There's nothing more important than your health and a nice current climate, you know, that's, that's very important. So I think that's how we ended up and our key focus of what we're trying to do with them to do that. One of the key things we have actually is our own sequel engine. Now having our own sequel engine to browse data to process data also means that we can bridge this divide, actually, and bring all that amazing business context to the problems we want to solve. And in
Tobias Macey
0:05:23
the tagline for your product calls out the data ops as a key word there. And I know that that's a term that's getting used more frequently now with varying degrees of meaning. So you mentioned being able to bring the business into the context of the data and ensure collaboration across the different team boundaries. But can you give your working definition of what you think constitutes data ops and some of the ways that the lenses platform helps to support the cross cutting concerns that come up when trying to bridge the gap between those different roles in an organization for being able to deliver that value of the data?
Andrew Stevenson
0:05:59
Yes. So I think I tend to agree with your definition of data Ops, they're the important bit is we've had a lot of movement around DevOps and has been very successful of how we make sure that we apply operation principles to get software delivered quickly, we also now need to make sure that we're applying that also at the data level. So how we manage access to the data, as well as how we actually provide the governance and actually did or ethics around what we're doing at the data layer. Plus, combining everything from the DevOps perspective, we want to take all those good parts. So what we do in lenses is we also make sure that not only do we have the visibility aspects from the SQL engine, we also have a very strong role based security model. So we apply the governance as well. But we also make sure we incorporate all the good points from the DevOps side of it, such as get ops everything and lenses as an API. So we conversion control all the attributes They go into making data platform successful. So we can move topics, we can move processes, we can move connectors, everything we need to get into production quickly, while still providing the visibility and the governance around the system.
Tobias Macey
0:07:15
You mentioned the fact that you have your own custom sequel engine. I'm wondering what the motivation was for building that out specifically for this product versus using some of the off the shelf engines that are wholesale available for some of these streaming platforms or leveraging some of the components that exist out there, such as the calcite project in the Apache ecosystem.
Andrew Stevenson
0:07:38
So when we started out, the first actual bit of SQL we did was a thing called Kafka Connect query language. So this is arguably the first SQL layer that was was there for Kafka. So this was that we introduced into the connectors. We looked at calcite at the time, but we didn't think we needed that for what we were trying to achieve in the connectors, or we did look at color sites. When we for the current sequel engine that we have, however, we weren't able to extend the grammar for where we want them to take the engine, so we decided to build our own. Now we built it on top of calf extreme. So effectively, our sequel engine boils down to a calf extreme application at the moment. And we chose that route because of you know, some of the advantages. We were seeing that it's just a lightweight library. That doesn't mean that we're not interested in bringing in all the technology, we will use sort of things such as Flink or Spark, and we're looking at how we can incorporate them into lenses because we want to just take what best technologies out there so Satan Nutella talks about this it's tech intensity using the commodity infrastructure and platforms that are out there now so we can get back to delivering business Sorry, I talk now about data intensity quite a lot. So that was the reason why we we went our own way with the sequel engine. And plus it allows us to actually then if we want to pull in other technologies such as pulsar such as Redis streams, for example, we can
Tobias Macey
0:09:00
So in terms of the collaboration aspects of it, SQL has become the universal dialect for working with data and something that is sufficiently descriptive and approachable, that it's possible to get different business users on board with it to at least be able to view the query and generally understand what's happening there without having to give them their own computer science course on how to program their own definitions. And I'm wondering what you have found to be some of the ways that that has helped in breaking down those silos between the engineering and business teams, and just some of the overall effectiveness of using that as a means of communication about the intent and purpose of the analysis that's being done and the ways that the data is being processed.
Andrew Stevenson
0:09:48
So I think I think you're right that the key thing is giving access. So sequel is the level playing field. So there are many companies that I've worked with in the past. Very, very highly technical. And engineers are capable of analyzing data without SQL, but the vast majority arms and sequels to create leverage in that what we see that actually is the benefit of SQL. And it goes for both the business users and even the developers is just the speed with which they can actually move into production. So and the savings that they actually have. So we have played ticker they use, they have 600 developers, and one of their key things of actually just using SQL to be able to debug their messages saves them about 30 minutes to an hour each day of 600 developers. So it's a big, it's a big saving there. And even we have other examples of for example, risk fuel, there are a risk calculation startup. So what they use the sequel to was to actually feed their machine learning models. So it's helping bridge the divide for everybody. Actually, it's enabling the developers to actually be more productive as well. And also then bridging the gap with more traditional businesses,
Tobias Macey
0:11:03
and then around the concept of streaming data and streaming workflows, what are some of the challenges that you see being common across engineering teams who are trying to build applications around this paradigm and some of the ways that the lenses platform helps to address some of those challenges.
Andrew Stevenson
0:11:22
So most of the reasons why we actually built lenses, it's a lot of the tooling. So for example, I was on a call today with a large travel company, and that platform becomes a bottleneck because they're stuck at the command line. So actually providing the tooling so people can actually see their data and actually check if that data is good and have a simple controlled and governed way to actually deploy their applications. That's the key you know, if if I'm in a smaller company and I'm I can develop quite quickly a smaller application but when I want to go into serious multinational company with All the compliance and security and auditing around that that's where it really becomes difficult to use the DIY tooling. So with lenses, because it's all API, so you can automate everything, we give them the visibility. So whether that's just looking for data or analyzing data, but also the capability to actually quickly and easily deploy those floors, the sequel floor so they can get back to more maybe more than more interesting thing, the machine learning model, rather than just wrangling data at the command line all the time.
Tobias Macey
0:12:31
And then another interesting element of this is just the overall evolution in terms of the capabilities of streaming systems and the overall adoption of them. And I'm wondering what you have found to be some of the most notable elements of that evolution or key sort of inflection points of when things started to take off or ways that streaming is being used.
Andrew Stevenson
0:12:54
So I personally, myself have always been working in a streaming world mainly because I've been working in finance But I think as where I saw it going more mainstream was his company started adopting Hadoop clusters as well, there's this, they see the value of it, but there's always the more need for speed effectively. And Kafka was in a good position, because I saw it installed actually in in a lot of environments because of the Hadoop ecosystem. So it was a natural movement on and especially from the financial point of view, they're used to this anywhere they used to have in real time data, what we're looking for then is scaling. And as we also have a lot of IoT customers now, and that becomes a natural fit as a VM. They want to progress that data analytics platform.
Tobias Macey
0:13:39
And then in terms of the tooling that you're providing with lenses, one of the pieces that I know is difficult in streaming contexts, is being able to get visibility into the amount of data being that's flowing through these pipelines, as well as some of the specifics of that information. So being able to do things like tracing as you would do. In a regular distributed system and the software aspect, but being able to understand how that data is flowing throughout your system and across the different components that are producing and consuming it, and I'm wondering what you have seen to be some of the useful metrics for tracking that and some of the ways that you expose that information for engineers and business users to be able to gain some visibility and understanding of how things are progressing.
Andrew Stevenson
0:14:24
So typically, there's the standard Gen X metrics that are very, very useful. However, that only gets you so far that tells you how an application is performing. What we're actually seeing is the real value comes when you see how the applications fit together. I call it the application landscape at a high level example. We had one we have one customer vortex. So they used to manage Kafka themselves. Now they've moved, the new stuff kept what's called Kafka Fridays, and Kafka would just fall down on them all the time. They've now moved to Ms care, but from the lenses perspective, one of the biggest things that they realized Was that when they put lenses on their infrastructure, they were able to see, effectively what a mess they've made. They had no concept of where and how applications will link together, we actually visualize that for you, and then bring the metrics on top of that to show the application performance. But for us, it actually goes just a bit beyond monitoring pure metrics coming out, where is my data? Where's it going to who's using it? And certainly in the UK, you know, in Europe, we have the GDPR regulation, we now have the ability, it's very important to actually just say, you know, show me everywhere that you have credit card information, and who's using that and where's it going to? So that's, I think, was it was a real eye opener and it was a real eye opener for our customers certainly vortex or when they're able to suddenly visualize not just the low level details, but actually how the applications interact with each other is something that you could also get from a trace. A distributed tracing
Tobias Macey
0:16:01
framework. And one of the benefits of streaming and Kafka in particular. And just the overall pub sub paradigm is that you can decouple the applications and you don't necessarily need to care about what the downstream consumers are of the messages at the point of generation. But as you said, being able to gain visibility into all the ways that that information is being used is valuable for governance purposes and for debugging purposes. And to make sure that as you do change the format of the upstream data, you can alert the downstream teams so that they know that they might need to make some changes to their applications. And I'm wondering what are some of the inherent difficulties in being able to generate that overall topology of the application architecture and application environment around those streaming pipelines and some of the ways that you're addressing that in lenses,
Andrew Stevenson
0:16:55
effectively, what you need to do is you need to guard application deployment or have a way for applications to register themselves after that, actually, it's, it's not that difficult, as long as you know the inputs and outputs of where the thing is that you're running and have a way to collect the metrics. So we hook into the standard frameworks such as Prometheus to collect the metrics, what we do is we provide a way that we can deploy the applications. So we actually know the inputs and the outputs. So if you think about the SQL, right, we know what the field was selecting from joining and writing to, it's the same with the connectors as well. And we also have various different API's and SDKs that you can write your own custom code to effectively register itself with lambda. So it's relatively easy to build that effect, especially when you have Kafka, as long as we're able to track the inputs and outputs. And of course, then we can track back even to the get commit for this application. So we can actually get a full lineage. I'm very big on complaints because I came from a financial background as well to actually show me what they application was that was deployed and what it was doing and what did it touch?
Tobias Macey
0:18:05
Once you have that visibility of the application topology, what are some of the other interesting things that you can do given that information that was difficult or impractical prior to having that visibility?
Andrew Stevenson
0:18:18
So a good example was always in, in the banking industry, show me your data sets that have credit card information in it and who's touching it, we have that now. We also have a thing called debt policies as well, we can we can apply masking on top of it at the presentation layer. But that's a good example of showing, you know, where is data being used, especially from a debt perspective? Where is credit card number, I want to not only from a compliance point of view, but maybe I'm a data scientist, and I'm interested in using it, where is it? Where's it going to what applications already processing it, maybe I can leverage that to share and maybe piggyback on that data as well. So we help create a catalog and we help create a shareable data experience all bound with the with the ethics and the governance. But also there are other aspects were found as well, you know, even our own internal systems, I'm reading from a database, what's the impact of this database goes down, who's going to be impacted downstream consumers as well. So there's, once you have this rich data catalog, in a way, you open up a lot of possibilities of the questions, you can ask it either from a discovery perspective, to share data, or even from a compliance perspective,
Tobias Macey
0:19:30
as well. And then another complicated aspect of working with streaming data and large volumes of data in general is that of the development cycle in terms of ensuring that what you're writing now is going to one work in the production environment, then to scale effectively because of the fact that it's difficult to replicate the production environment into a developer's laptop or into QA environment for being able to do some vetting of that. And I'm wondering what you have found to be some of the useful strategies for creating that development workflow so that it's useful and effective, without being cumbersome for people to be able to get their local environments set up to be able to work together and ensure that also the message structures that they're working with locally are going to match what they are going to be consuming once they get deployed.
Andrew Stevenson
0:20:21
Yeah, so what we have is actually we have an all in one Docker, so that's a great test harness, actually, for developers to work with. And also the nice thing about certainly the Kafka Connect framework and the super processes, we have zero config. And if you couple that with the schema registry as well and have Avro in there, then you've got that contract, you've got that API contract for the data as well. So promoting between environments is relatively easy. It's a conflict about a yamo file that we can put in in GitHub. We can do a pull request onto the master branch and lenses can actually apply the desired state on to the Catholic cluster in whatever environment That can be we see people use this to go from on prem into the cloud, or to any cluster, it doesn't doesn't really matter whether it's on prem to the cloud or from one provider to another. So that's where the data ops aspect also fits into the DevOps aspect. I can define my entire landscape as configuration and have that applied by lenses using a get ops approach into into another environment. And that fits very well with a standard developers workflow. It fits into the CI CD path that we have, we're even looking to actually wrap the lenses Seelye that we have into a Kubernetes operators so we can hook into the Kubernetes tool chain as well.
Tobias Macey
0:21:38
And another interesting piece from what you're saying there too, is the idea of ci and CD of these streaming applications. And also going further into some of the DevOps paradigms. I'm wondering what your thoughts are on the viability of doing things like Canary deployments to ensure that as you're deploying something, you can just do some sampling to see is this producing the outputs They expected and doing some, you know, maybe feature flagging to ensure that as you roll things out, they're not going to break things downstream, you can do some validation, maybe a B testing of it, and just some of the complexities that exist in that ci CD and validation phase of building these types of applications.
Andrew Stevenson
0:22:17
So I think with with a get off the way we see it with the good tops, and everything is in API in lenses is we allow people to build out whatever strategy they have there again, you know, actually being able to deploy something and sample the data lenders also gives you that can even do that in an automated fashion because everything even the sequel browsing boot we have is via an API. So deploying checking and switching. You know, that's it's I haven't To be honest, I haven't really thought too much about that. But we enable that by just actually fitting into standard c ICD development practices. So what I actually like actually what I was doing this not too long ago, is this type of workflow, even if I'm at, let's say, your Python developer, I could sit in Japan, and I can query something in inside of Kafka. And I can actually then go, Okay, I'm happy with that. Let's deploy a sequel processor to do something, I'm happy with the output of that, because I can have that stream back to my paper notebook, and then maybe to deploy a connector to take that off to my environment. And again, I can version control all that configures code. So I can construct my ci CD pipeline, however, I want to do that.
Tobias Macey
0:23:30
And then in terms of the lenses platform itself, can you dig a bit more into how it's designed and implemented and some of the evolution that it's gone through since he first began working on it?
Andrew Stevenson
0:23:41
Yeah, so actually, if you look at four lenses is it's actually just a JVM app. It's actually written in Scala. And underneath the hood is just anyway, it's the Kafka standard regular Catholic lines for the consumers and the producers. Obviously, you know, we have some secret sauce and now of how we build out and run the sequel engine. But that's all it is. So it's quite just quite actually quite a small lightweight application and there's a small backing embedded data store we have there with it as well. So it just need connectivity to the cluster in terms of the evolution we have. It started off being very Kafka centric, we're trying to pull it away from Kafka because we maybe want to swap out pulsar or Redis or any other streaming technology in the future. The biggest I think evolution that's going on now actually, is the the sequel engine we have, we're about to release a new version of SQL engine that really opens up what we can do with it to allow us to plug in or the systems if we want to, to bring in Python support for it as well. So So pushing pushing beyond Kafka, I think is where we're, we're moving to
Tobias Macey
0:24:49
today's episode of the data engineering podcast is sponsored by data dog, a SaaS based monitoring and analytics platform for cloud scale infrastructure, applications, logs, and more. Data dog uses machine learning based algorithms to detect errors and anomalies across your entire stack, which reduces the time it takes to detect and address outages and helps promote collaboration between data engineering, operations and the rest of the company. Go to data engineering podcast.com slash data dog today to start your free 14 day trial. And if you start a trial and install data dogs agent, they'll send you a free t shirt. And in terms of being able to support those different engines, I watched a presentation that you gave recently at the writers conference for being able to run sequel across Redis streams. And I know that you also are working on pulsar support. And I'm curious what are some of the ways that you're approaching the modularity of your system to support those different back end engines and some of the complexities that arise as far as trying to establish a lowest common denominator or feature parity while still having some capability of being able to take advantage of the specifics of Underlying engines,
Andrew Stevenson
0:26:01
so from from the lenses feature perspective, the hardest part there actually is naming, naming conventions. And for example, as we push out and build the data catalog, we're moving away from the notion of topics because we're broke, we also now can query Elastic Search. So it's finding terminology to model the data set inside of lenses is a challenge. But in terms of the actual way that lenses has been implemented, I mean, at the core of it's the sequel engine and the sequel engine itself is is pluggable, certainly for the browsing aspect of it. So it's very easy for us to plug in another sequel engine because it's weird effect we call internally we call them drivers so it's not that hard for us to do the with that the guys have architected it now I'll be honest, I'm you know, sometimes that code is way beyond my my remit, but it's built on a modular level. So we can plug in different systems that we want to inside of out of lenses. So for example, Finding out the sequel browsing not only for elastic that we have now we have pulsar support. And that's not turned on. It's very similar to what you saw in the Reddit streams, as well. We're pushing that out to every system that potentially we could connect to as well. The sequel engine, we were making that modular so we could run it on top of pulsar. However, there seems to be a consolidation I would say towards the calf gear API's, for example, string native, they now have a calf get the pulsar bridge. So actually, they're helping us as well, because we we don't have to implement that on the streaming side ourselves. Even such things like Azure Event Hubs, right. It's got a cafe API lenders actually does work against Azure
Tobias Macey
0:27:42
Event Hubs. Yeah, I was gonna ask about whether you were using that compatibility layer that they recently introduced. And also, what are some of the other aspects of the ecosystem that you're hooking into for being able to process on top of that, but yeah, I was pretty excited when I saw the interview. Have that protocol compatibility layer in pulsar, and I'm excited to see where they go with that piece of things.
Andrew Stevenson
0:28:05
Yeah, I mean, they're very eager for us to to work on it right, because I think pulsar is actually very good technology. The problem is the ecosystem around it. So we help build out the ecosystem around Kafka. And as soon as possible gets at, then I think it's a natural fit. Even the client I was talking to today, they're not looking for vendor lock in technology lock in towards Kafka. So they're also excited about the possibility of pulsar. Now we already have it, right. So we cannot because a sequel engine split into two parts, it's into it more table based engine, which allows the debugging and the querying like in a relational database format that's very, very pluggable, we were going to have to do a bit of work to extend the sequel engine to run over pulsar, but with the bridge, we were going to evaluate that now this to see if we don't need to do that work. There's even another vectorized IO. We've got another category API compatible replacement out which they claimed 10 times as fast. So you know, we're also looking to see if we can, we can do that it's the same. I think with Kubernetes, you end up coming back to a battle of the API. So API's become dominant, which I think also helps helps. Yeah, it's
Tobias Macey
0:29:14
definitely an interesting trend that I've seen in a few different places have different technology implementations, adopting the API of the dominant worker in the space, one of the notable ones being s3, where all the different object storage is are adding a compatibility layer in the Python ecosystem. A lot of different projects are adopting the NumPy API while swapping out some of the specifics of the processing layer. And then in the streaming space, they're working on coalescing around Kafka, but they're also working on the open streaming specification to try and consolidate the specifics of how you work with those systems so that they can innovate on the specifics of how that system actually functions under the hood.
Andrew Stevenson
0:29:55
Yeah. So the ecosystem is very, very important. You know, this is what we're seeing there. of the problems. So another example it's not necessarily related to lenders, but I still have friends who work in high frequency trading world. They're adopting pulsar for the modeling capabilities of it modeling option derivatives, there's lots of them's millions and millions of them and so pulsar gives them that ability, but they're using it in their trading Excel, they didn't want to open it up to the rest of the company for the day to day integration parts of it, because the system the ecosystems, not not round it as well. So they actually use capital for that. And they're looking at lenses to put on there but for the the more bleeding edge work that they're doing poor souls a great fit, and they're also very excited about the bridge as well because they can bridge that gap.
Tobias Macey
0:30:42
So beyond the specifics of some of the monitoring capabilities that you're building and the sequel layer to bring everybody on board with a common language for being able to work with data, what are some of the other aspects of operationalizing data and bringing more people into working on it together and collaborating that you're looking to either within lenses or that you think just generally within the industry, we should be adopting to ensure that it's easier to be able to build value more rapidly from the data that's being collected and consumed.
Andrew Stevenson
0:31:14
So one of the things we're looking at is the catalog. Again, the sequel engine forms a big part of that. It's the discovery of data, and how fast can I share my data experience. And I'm actually quite a big fan of the data mesh architecture, having delivering data is a product end to end. So what we're looking at is how we can use the sequel engine not only for the processing and the visibility, but also to build up that rich data catalog so we can figure out where data is and share it, but one of the big things that we're keen on is you've got to do that with some form of complaint. So we actually put a multi tenancy system on top of Kafka it doesn't have it Okay, pause outdoors so we can actually safely onboard people onto these systems so they can can use the data catalog to actually drive back to building a data product. So I think that's that's also what I'm seeing. There's a big project from IBM called a JIRA, around the data catalog. So we can actually make use of all this data that's out there. For me, it always comes back to this use this tech intensity, the commodity hardware, the commodity software has been built the commodity infrastructure to go back to actually making use of the data. What am I What am I doing with this data? Why am I here? You know, I'm, I'm a little bit controversial sometimes, because I say you're not a technology company. You're a company that's delivering a service. And technology is enabling it, actually.
Tobias Macey
0:32:36
And so in terms of the lenses platform itself, what are some of the most interesting or unexpected or innovative ways that it's being used or insights that you've seen people gain from it as they have adopted it for their streaming architectures?
Andrew Stevenson
0:32:52
Well, what we actually see is that people start off with Kafka and lenders first, just get visibility on what's happening with Kafka and then as they mature, they move into different aspects of using the sequel engine as well to feed different machine learning models over speaking to one client and what they what they're looking at doing is actually optimizing how the firing lasers at 10 droplets for microchip processing, and they want to feed that back in in a real time loop so they can optimize the shape of these 10 droplets. So I think there's a wide range of industries we're in. So we have everything from standard ETL work to more machine learning driven, we have IoT, especially in Canada, the tracking cows, you know, so this is a range of technologies and use cases. But, you know, I always go back to Babylon health because I think that's using lenses to help build that tech intensity to feed that AI model so that we can build this healthcare for everyone across the world. I think that's always a great use case that they're doing. You know, The ability to actually run these SQL engines and to feed data into their AI models is is pretty cool with a great outcome, especially in the current climate.
Tobias Macey
0:34:10
And one other aspect of the streaming environment is the at least perceived dichotomy between streaming processes and batch processes where a lot of some of the ETL workflows are operating in more of a batch mode where and then a lot of people are moving to using things like Kafka and pulser for streaming ETL. I'm wondering what your thoughts are on some of the benefits of that dichotomy of those different approaches, and maybe some of the ways that they can and should merge together and just treat everything as a continuous stream?
Andrew Stevenson
0:34:44
Well, I think everything is a stream, right? You know, I think that's been well publicized and the batch is just a subset. The reality is though, especially as you move into large organizations, everything is still based around a batch. system, there are upstream systems a batch, right? So it's, it's very hard to completely move from one to the other. For example, when I was a Dutch energy company, they couldn't get past asking, Where is my FTP, where's my CSV file, and no matter how much we push them even to a streaming solution, you know, you know, having a Kafka connector, stand up to pull in those files from an FTP server and streaming the Kafka, the upstream system is still sending in in batch. And until you actually get the entire landscape moving to a streaming world, you're still going to have these two versions of doing it. It's especially prevalent in the in the financial industry where they try to do risk calculations, and they need the whole set of data at the end of the day. So there's still this infrastructure layer around it. That makes it very hard for some companies that are not streaming native to actually take the plunge and completely move away. I personally see no problem with having the Working together, right? You know, if you need to have a batch driven world and you push everything into Kafka and have a trigger or event on the back of that, that does something else, that's fine. If you look at a debt, mesh architecture, you know, push it to something that is suitable for your use case. Kafka is only one part of the solution is part of the bigger, bigger platform. So I have no problem with them. working side by side, it's just very, very difficult to change. If you've got a mainframe, and it's spitting out a file, they're not going to so you're not going to get the mainframe changed. So you've got this batch file. And yes, you can feed it into a streaming system, but a lot of the architecture and a lot of the business is driven around a batch processing, end of day reporting, especially, you know, in the trading world, what's the end of day risk, what's the end of day positions as well, that's always the bottom line. They do have intraday trading things but it's always end of day. That's the one that's legally binding. So until the business actually shifts as well. It's gonna be hard to Completely moved to a streaming
Tobias Macey
0:37:01
architecture. And then another note on the idea of streaming ETL is the thought of the workflow orchestration where we have tools such as Daxter, and airflow that are built around this batch model of take this thing, process it do the next step. And then there's the concept of as we were discussing earlier, the topology of applications that are interacting with your streaming system and wondering what you have seen to be some of the crossover of the workflow management for data processing and how that can coexist or function natively with stream processing.
Andrew Stevenson
0:37:38
So what I've generally seen is that the batch processing, whether it's in dogs, or anything else can be triggered off the events and you know, they're coming from Kafka. It may be a sequel processor that's running real time that spits out a result that triggers a job to do something. That's the general pattern we see. So that's why it's They can, they can coexist. And if you need to have a batch process that's triggered off an event, okay, well, the streaming architecture can help you do that. And Kafka can help you do that. But you're, you're reacting to an event to trigger into the process. This is quite common, you know, I could I write stuff into s3, and there's a lambda on the other side that may be triggers an email when there's a new file rolled in as well. So they live side by side, if I'm writing to s3, I've got a batch into a file anywhere for performance. So it's, it's never clear cut, that you are all into streaming and everything has to be streaming. I think there's always a place
Tobias Macey
0:38:39
for both in your own work on the lenses platform and using it within your own systems. What have you found to be some of the most interesting or unexpected or challenging lessons that you've learned in the process?
Andrew Stevenson
0:38:51
I would actually say discovering dead, and subsample weekly collect telemetry of our own of lenses as well and for our own system. And we were actually it was a bit of an eye opener when we were trying to figure out how much data we were collecting to actually pull it into lenses. So we could use lenses internally. So it really opened my eyes to the challenges of being able to discover and manage data at scale. I'm not saying that, you know, we're on anywhere near an investment banking scale. But it's interesting how quickly you accumulate different silos of data, and being able to discover them quickly as well. So this is all things that we're feeding back into the product as well, and also based on our own experience. And then for people who are interested in gaining better visibility into their systems or having a data ops flow around streaming, what are some of the cases where lenses might be the wrong choice, and they might be better suited by looking to other tools or building them in house? So I think lenses can generally help right now, I think the question is, do you need Kafka? Do you need a distributed streaming platform or not, I think is more. What I would say there. Now, these come with a cost. Well, it doesn't matter whether it's pulsar, or Kafka, there is a cost to running these applications. And then you have to have all the monitoring around it, and the tooling around it, why we built lenses. So I think it's more of a question of, do you really need Kafka? Do you really need to scale? You know, we've had potential clients that maybe had one message every hour, okay, probably, you don't need Kafka, right? Maybe you'd be better off with a traditional JMS Message Broker or something like that along those lines. So, if you do need caffeine, if you do need the scale of Kafka, then we help but I'm a bit of a pragmatist about it, you know, using the right technology. If you don't need a big data solution, don't pick a big data solution, and a lot of people don't. I've seen this and I saw this at one company, they were not necessarily around Kapha. They were adamant, and maybe for a bit of resume plus, plus They were like, No, no, we need a Spark cluster, we've got lots of data, they didn't have lots of data. And they were already familiar with SQL Server, they would have been far better off actually just putting SQL SQL Server in there. So do you need caffeine? Do you need distribute systems? That's I think that's ultimately the choice there.
Tobias Macey
0:41:19
And as you look to the future of the lenses platform, and the streaming ecosystem in general, and the data ops principles that you're trying to operate around, what are some of the things that you have planned for the future of both the technical and business aspects of what you're working on?
Andrew Stevenson
0:41:35
So we are looking at the cloud, you know, we think that managed solutions, you know, because all about focusing on on the data and that commodity and technology that they have, I can go to the cloud and get any get Kafka and Kubernetes. Now with all the databases and the key vaults around a need, so we're looking at how we further integrate with the cloud as well. Well, we're also continuing to build out the data catalog aspect of it. And on the on the sequel side, you know, we're continuing to push out what systems we can query and also looking to we call it SQL lambdas. Although I'm not convinced that's still the correct term of pushing more into the sequel engine that we can connect and write to other systems, for example, Kafka Connect, great, but do I want to have to stand up a Kafka Connect? cluster, when I don't need one, maybe Actually, my sequel engine can actually directly write out to Elastic Search or anything else. So that's, that's where we're looking at, along with a lot more Still, we think we can do around around the government, the governance side of it, approval flows, for example, to making sure that we get data stewardship in there, and the governance aspect. So that's our general focus of where we're going.
Tobias Macey
0:42:50
Are there any other aspects of the lenses platform itself or data ops principles in general are some of the ways that people are using lenses or fits into the overall workflow of building streaming applications that we didn't discuss that you'd like to cover before we close out the show.
Andrew Stevenson
0:43:06
No, I think the most important thing for us is the, you know, having that tooling and the visibility around it. That's where I've seen the success. And we I think we have to really start thinking about what we want to do with the data and using the technology to get there. And to try out lenses pretty simple. We have a free box, you can go to lynda.io and download the box as well. We have a hosted online portal, if you want to.
Tobias Macey
0:43:30
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing. I'll have you add your preferred contact information to the shownotes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
Andrew Stevenson
0:43:44
Visibility, it's always been visibility. I would say for me, I went from using SQL Server and moving more data than I ever did, actually, with big data technologies and having that visibility is the killer. You know, I went I was at an investment bank and you try and give the head of trading to an investment bank, the command line tool to say, hey, you go go look at your data, it doesn't work. So if you really want to make a success of your platform, you have to provide that visibility into not just the infrastructure, but into the applications in into the data and get that in the hands of the people who understand it.
Tobias Macey
0:44:20
Well, thank you very much for taking the time today to join me and discuss the work that you've been doing on lenses. And some of the ways that you're helping to empower people who are building streaming infrastructure is to gain some of that visibility into the applications that they're building and the data that they're working with. It's definitely a very challenging and necessary area of work. So I appreciate all the effort you put into that and I hope you enjoy the rest of your day.
Andrew Stevenson
0:44:42
Yeah. Okay. Thank you very much for having me.
Tobias Macey
0:44:49
Listening, don't forget to check out our other show podcast.in it at Python podcast calm to learn about the Python language, its community in the innovative ways it is being used. visit the site at data engineering podcast comm to subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story, and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers

Liked it? Take a second to support the Data Engineering Podcast on Patreon!