Enterprise Data Operations And Orchestration At Infoworks - Episode 131

Summary

Data management is hard at any scale, but working in the context of an enterprise organization adds even greater complexity. Infoworks is a platform built to provide a unified set of tooling for managing the full lifecycle of data in large businesses. By reducing the barrier to entry with a graphical interface for defining data transformations and analysis, it makes it easier to bring the domain experts into the process. In this interview co-founder and CTO of Infoworks Amar Arsikere explains the unique challenges faced by enterprise organizations, how the platform is architected to provide the needed flexibility and scale, and how a unified platform for data improves the outcomes of the organizations using it.

Ascend.io logo

Data pipelines are the backbone of modern data systems and yet existing solutions require excessive coding and don’t operationally scale. The Ascend Unified Data Engineering Platform makes it possible for teams to quickly and easily build autonomous data pipelines that dynamically adapt to changes in data, code, and environment. Data engineers using Ascend can now ingest, build, integrate, run, and govern advanced data pipelines with 95% less code. Start building on Ascend today with a free 30-day trial and partner with a dedicated data engineer to help you get started.


Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $60 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Free yourself from maintaining brittle data pipelines that require excessive coding and don’t operationally scale. With the Ascend Unified Data Engineering Platform, you and your team can easily build autonomous data pipelines that dynamically adapt to changes in data, code, and environment — enabling 10x faster build velocity and automated maintenance. On Ascend, data engineers can ingest, build, integrate, run, and govern advanced data pipelines with 95% less code. Go to dataengineeringpodcast.com/ascend to start building with a free 30-day trial. You’ll partner with a dedicated data engineer at Ascend to help you get started and accelerate your journey from prototype to production.
  • Your host is Tobias Macey and today I’m interviewing Amar Arsikere about the Infoworks platform for enterprise data operations and orchestration

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what you have built at Infoworks and the story of how it got started?
  • What are the fundamental challenges that often plague organizations dealing with "big data"?
    • How do those challenges change or compound in the context of an enterprise organization?
    • What are some of the unique needs that enterprise organizations have of their data?
  • What are the design or technical limitations of existing big data technologies that contribute to the overall difficulty of using or integrating them effectively?
  • What are some of the tools or platforms that InfoWorks replaces in the overall data lifecycle?
    • How do you identify and prioritize the integrations that you build?
  • How is Infoworks itself architected and how has it evolved since you first built it?
  • Discoverability and reuse of data is one of the biggest challenges facing organizations of all sizes. How do you address that in your platform?
  • What are the roles that use InfoWorks in their day-to-day?
    • What does the workflow look like for each of those roles?
  • Can you talk through the overall lifecycle of a unit of data in InfoWorks and the different subsystems that it interacts with at each stage?
  • What are some of the design challenges that you face in building a UI oriented workflow while providing the necessary level of control for these systems?
    • How do you handle versioning of pipelines and validation of new iterations prior to production release?
    • What are the cases where the no code, graphical paradigm for data orchestration breaks down?
  • What are some of the most challenging, interesting, or unexpected lessons that you have learned since starting Infoworks?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:13
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the project here but on the show, you'll need some more to deploy it. So check out our friends over at linode. With 200 gigabit private networking, scalable shared block storage, a 40 gigabit public network fast object storage and a brand new managed Kubernetes platform you get everything you need to run a fast, reliable and bulletproof data platform. And for your machine learning workloads. They've got dedicated CPU and GPU instances. Go to data engineering podcast.com slash linode today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. for yourself for maintaining brittle data pipelines that require excessive coding and don't operationally scale with the Ascend unified data engineering platform. You Your team can easily build autonomous data pipelines that dynamically adapt to changes in data code and environment, enabling 10 times faster build velocity and automated maintenance. On ascend, data engineers can ingest build, integrate, run and govern Advanced Data pipelines with 95% less code, go to data engineering podcast, comm slash ascend to start building with a free 30 day trial, you will partner with a dedicated data engineer at ascend to help you get started and accelerate your journey from prototype to production. Your host is Tobias Macey, and today I'm interviewing Amar Arsikere about the info works platform for enterprise data operations and orchestration. So Amar Can you start by introducing yourself?
Amar Arsikere
0:01:42
Yeah, absolutely. Thank you for having me, Tobias. Yeah. My name is Amar Arsikere. I'm the founder and chief technology officer at Infoworks. So my background is building large scale data systems. I started my career as a software engineer. I built a large data systems at Google I built the first data warehouse big table at Google and an analytics platform that ran of the trend all of the internal analytics there. And that's really how I got started in data management. And, you know, there's a lot of inspiration from that work that I did there in starting infoblox.
Tobias Macey
0:02:15
So what was your best resource for being able to learn about some of the best practices and the discovery involved in being able to build out that data warehouse on top of big table and some of the challenges that you faced in the process?
Amar Arsikere
0:02:28
So building in big data warehouse and a petabyte scale system on top of big table and big data technologies was was pretty brand new. So we were doing the world's first data warehouse on top of big table and the resources were really the original inventors of big table and, you know, the big data technologies inside Google. And in fact, a lot of the open source systems, you know, Hadoop and Spark came out of the original paper that was published out of Google. So and I had access to those resources inside inside Google, so to speak. And that's how, you know, I was able to build it and ran a pretty successful analytics platform that had thousands of use cases, thousands of users and petabyte scale data sets.
Tobias Macey
0:03:12
And now you've built up the info works platform to be able to solve some of the challenges that you've that you've experienced in the Big Data space and some of the problems that are inherent to large organizations. I'm wondering if you can give a bit more detail into what you've built that info works and some of the story behind how it got started.
Amar Arsikere
0:03:31
Yeah, you know, as I said earlier, in my first exposure to the large scale data system was at Google built the world's first data warehouse on Bigtable there and after Google, I joined a company called Zynga so you know, social gaming company, and they had a unique challenge of large scale data sets but also needed to make predictions and you know, analytics on top of it very quickly. And out there I built an in memory database, supporting more than hundred million players became the world's largest in memory database at that time. And the unique thing that I did there was to build like thousands of analytics pipelines, you know, that used to feed everything about gaming dashboards, to, you know, the gamers behavior and recommendations and so on. So the lesson that I learned building these two systems was that there was a need to automate a lot of the data operations in, you know, lots of enterprises, and the one digital companies like the Google and Zynga and Facebook and Amazon, they had built all these platforms for their internal consumption, you know, it made sense that this is something that's going to be required by any data driven organization. You know, that was the sort of the inspiration to start infoblox. And the internetting of the product team that has built infoblox essentially have the same background that I have, which is running large scale analytics pipelines. And and, you know, when we talk about large scale, I'm talking about thousands of analytics pipelines being built and managed on this platform
Tobias Macey
0:04:58
and what are some The fundamental challenges that often plague organizations that are dealing with quote unquote, big data, and how do those challenges change or compound in the context of an enterprise organization and the organizational complexities that manifest there?
Amar Arsikere
0:05:14
You know, this is a great question. You know, if you look at the usage of data and how people are managing their data assets, you can, you know, really segment the world into two sets of companies that are the bond digital companies like the Google, Facebook, Amazon, Netflix and Zynga and so on, where they have a foundational platform on which they are essentially always building out a 360 degree view of their business. On the other side, you have all these other companies who are in various stages of data maturity, where the approach is to do use case by use case, you know, sort of a build out. So they are gathering data for every use case. And as a result, there is a fragmentation of the data assets within the company and to gather a 360 degree view of the business it becomes, you know, pretty challenging. In this world, I mean, you know, it's not built on a foundation platform so much. It's built on point tools. And there is a lot of glue code that dominates the world. And that becomes also very challenging. The data gets fragmented, teams get fragmented, the skill sets gets fragmented. So these are some fundamental challenges that many companies are facing when they have to deal with, you know, data that represents the business, you know, you can call it big data, because the amount of data is also large, the complexity is pretty large. And and how do you sort of manage all of this becomes very critical.
Tobias Macey
0:06:31
And in terms of the enterprise organizations, what are some of the unique needs that they have of their data that aren't necessarily going to manifest at either smaller scales or for newer companies that are maybe large, but don't necessarily have as much of that legacy infrastructure and legacy data that they need to be able to support?
Amar Arsikere
0:06:51
Yeah, so in enterprises, I mean, number one, now every enterprise is a data driven organization. So the need for data itself is accelerating, every department is becoming data driven, which means they need data to make decisions. And as a result, every organization in order to become data driven, they really have to build thousands of use cases and and you know, where they are essentially using their data, the fresh data, that becomes also very important to make the decision. And today's state of the art is to, you know, have 10 or 20 use cases that that will be built at best. Because, you know, in today's architecture, you have to hire a team of engineers who are coding the those pipelines and then, you know, they take it takes up an amount of time to sort of operationalize it in production and and there is a limit to what you can do versus automating a lot of the data operations that is going to get you the, the agility that you need to run an organization and become much more of the data driven organization. So that's a unique you know, set of challenges that enterprises Phase, they also have the the legacy tooling and the point tools and the blue code that, you know, that you mentioned earlier, which becomes, you know, a challenge to maintain in when when you're introducing sort of new use cases.
Tobias Macey
0:08:12
And the big data technologies that we have now are generally fairly built for purpose, either by the original organization that used it and then open sourced it or by the academic institution that was using it for a particular area of research, which can lead to some sharp edges or difficulties and integrating it into the larger ecosystem. But from your perspective, what have you found to be some of the design or technical limitations of those existing technologies? And how does that contribute to the overall difficulty of using or integrating them effectively into an enterprise organization? Yeah,
Amar Arsikere
0:08:48
I think the legacy integration technologies, right, I mean, they are, they have a limitation number one, you know, it is still pretty coding heavy and you know, you still have a lot of coding that you have to do. I mean, there are many tools which are, you know, visual programming based, but it's still programming. So which means you're still programming and you know, building out those pipelines and things like that, what happens when there are schema changes, what happens when you have to do time series analysis, and none of those things are automated, you're sort of building out all of this in a manually. So the automation part is very critical to achieve these thousands of use cases, to become data driven organization. So that's one limitation in sort of the existing legacy technologies that are that are there. And also many of the legacy technologies or the integration tools were built for a sequel based database and and as a result, it becomes, you know, sequel centric in a distributed world. Whether you're using big data technologies like Hadoop or Spark, it's it's important for the for the tooling layer, the data engineering layer to know about this distributed infrastructure. One example is, you know, the distributed technologies have this No parallelism built in. But your data needs to be organized. And what I mean by data needs to be organized to make use of that parallelism is that the data needs to be partitioned rightly and NO SEQUEL based technologies not do not necessarily deal with that. So that's one of the technical challenges we solve with infoblox. where, you know, if you have a partition of one, you get a parallelism of one, if you have a partition of 100, you get a parallelism of 100. But that organizing the data into right partitions is something that we do. You know, we support what is called hierarchical partitioning to support those kinds of getting the benefit of the underlying, you know, technologies. And that is becomes important as you are dealing with larger volumes of data, more use cases, and so on, you need to get the power of your distributed big data architecture into your tooling layer.
Tobias Macey
0:10:48
So for an organization that might be looking to use infraworks to solve some of their data challenges, what are some of the signs or symptoms that would lead them in your direction?
Amar Arsikere
0:10:58
So one of the traps that we see is and so the, the other thing that a number of the companies have already built is, you know, sort of the Do It Yourself platform for this, especially the new technologies, like the, you know, the Hadoop and Spark based systems, one of the challenges that they face is, you know, the continuing investment of engineering in maintaining those, you know, do to yourself platforms. And, you know, some of the duties of platforms are essentially built on point tools with a lot of glue code. So when maintenance becomes a challenge, many of the enterprises, you know, we have worked with them, they have come to us where we have successfully replaced that in house, do it yourself platform with a with an infoblox platform that has, you know, given them the agility to run their organization. So that's, that's one of the sort of successful journey for an enterprise.
Tobias Macey
0:11:51
And so for the organizations that are integrating and fireworks into their system, what are some of the tools or technologies that they might be replacing And what is the process of actually integrating the infoblox platform into their data technologies that are already running?
Amar Arsikere
0:12:07
So one thing that you know, from a replacement standpoint, there are a number of legacy integration tools. It could be the ETL tools that you know, you may be familiar with, like Informatica talent Pentaho. These are the legacy integration tools, you know, we have, you know, replaced in many instances, it's it could be ingestion tools, it could be like scoop or knife phi or, or Golden Gate in some cases for doing CDC, it could be like cloud native tools that you know, that are therefore, whether it is for ETL or injection or orchestration tools. So, these are the some of the tools where we have, you know, replaced and infoblox just to first sort of talk about what is it that we do infoblox is an enterprise data operations and orchestration platform. So, it spans all of the data operations functionality like you know, injection CD See, merge and building a time series of your datasets as data comes in data transformation data modeling, and then orchestrating all these, you know, pipelines in production so that we span the all the gamut of the data operations. So these are the technologies and the tools that we replace in an enterprise setting.
Tobias Macey
0:13:20
And then as far as the integrations that you build to make sure that you can work with all the systems that are pre existing at these organizations, how do you identify and prioritize your work to make sure that you have sufficient support for being able to provide the functionality that is needed?
Amar Arsikere
0:13:38
Yeah. So there are, you know, three layers of integration that you can sort of, you know, you can categorize in the world of data operations. On the one side you have the data sources, so we support integrations into most of the enterprise data sources, we have 30 plus data sources all the way from Oracle Teradata SQL Server file system streaming data sources, mainframe connectors, API based data sources and so on. And typically, it's mostly it's sufficient in many cases, you know, to sort of ingest data on board data sources and so on, we also have a connector framework. In case one of the data sources you have is not supported, you can easily add support for your new data source, and customize it and build it out. And and so those connectors can be, you know, added as an enhancement into the base platform you're building on top of the existing platform. And in in the other category of integration is, if you have your own custom code for data transformations and things like that, we support you know, two kinds of integration, you can sort of bring your own code and make it as a node inside infoblox. Or we also support a loosely coupled integration where you can sort of call into your system, pass parameters and all that all those things using our orchestrator mechanism to manage those dependencies and parameter parsing and, you know, fault tolerant sort of execution of those pipelines. And, and then pass the control back into infoblox to, you know, for further processing. So that's much more of a loosely coupled integration. And the third one is in our connecting to, you know, different layers for data consumption, whether it's Cloud Endpoints, it could be like, you know, BigQuery, or some other technologies that you want to deliver the data. So we support a number of, you know, those integrations as well. And the the way we prioritize these integration is really, you know, we are very tightly interconnected to our customers and their and the use cases that they are building, we are essentially supporting those initiatives and prioritizing based upon, you know, their their use cases are going
Tobias Macey
0:15:40
and what are some of the key principles on which the infraworks platform is built that have guided your development and improvements of the overall capabilities of the system?
Amar Arsikere
0:15:51
Yeah, absolutely. So infoblox I mean, we call it the three pillars. infoblox platform is built on you know, these three pillars. The first one is deep automation. So, you know, and that's our background, anything and everything that we can automate in data operations we have built in automation. So which enables people to build out these thousands of use cases. The second one is infrastructure abstraction, because there is going to be different kinds of data, computation technologies, distributed execution engines, and so on. We have built this on an infrastructure abstraction. So you can deploy it in any sort of an environment, whether it's on premise in the cloud or in a hybrid setting. And the third principle on which we have built the platform is to make all of the data operations available in a single place. So you can think of it as a integrated solution, whether it's for onboarding a data source, transforming your data source, or operationalizing it so all these three, functionality is available in a single place. So it's easy for you to adapt to changes as things are evolving in your use case. And the fourth one, which is important, I'm just going to add that is, is that it's, it's built natively for this big data based systems, the distributed architectures. And that is very important in the sense that, you know, these ballot engines provide a lot of power and unless you take, you know, in provide a native integration into these paddling engines, you're not going to get the benefit of it. And that's the that's what we have done with this infoblox platform.
Tobias Macey
0:17:29
Can you talk a bit more about how the infraworks platform itself is architected? And some of the ways that that design has evolved since he first began working on it?
Amar Arsikere
0:17:37
Yeah, absolutely. So you know, our origins were in terms of, you know, automating the data operations and and that's where that's where we came from. And the way that the infoblox system was orchestrated, was essentially that it provided automation for all of the data operations and it encompasses, you know, the entire gamut of the data operations. So we start With crawling of a data source, so we fetch everything about a data source like in other metadata, the relationships between tables and things like that. And once the crawling of the metadata is completed, you have the option of ingesting the data both like enough from a historical load standpoint where you need to load all of the data in the beginning to doing CDC, right, the Change Data Capture where you know, as data is changing in the source systems, you can bring change in data into into your analytics platform or your big data system. And we support something called as change schema capture. So if there is a change in schema, it automatically adapts to those changes. It does things like that fails and all the things that you need to do for supporting the new schema. And the one of the unique things about a big data system is the big data systems are typically immutable, which means if there are records updated in the source system, there is no easy way for you to change that on your on your big data. In our system, where you're building your analytics, so we support a continuous merge process. So the continuous merge process is, you know, automatically syncs the data and gives you a view into the data that it provides an updated view, right. And that's very important in this new architecture, you know, where you're dealing with immutable systems. And as the data is coming in, we are, you know, auditing with the time series. So the, the data is organized on a time series with current and history tables. So this is, you know, something that lets you do a lot of the time series analysis very quickly. And that, you know, essentially builds your data catalog that is continuously refreshed and maintained, and it provides the basis for the rest of the processing. And we have this concept of data domains where the data catalog can be then provision into these data domains where data admins and data scientists can come in and perform transformation operations data modeling operations on top of that data catalog and they are Seeing a slice of the data catalog, you know, that way they can, you know, you can have control over the data access and who sees what and the data transformation pipeline is also heavily automated when when you compare infoblox against the visual programming approaches, if you have to deal with things like incremental pipeline, you know, it's a single click operation in infoblox versus in a legacy ETL tool, you have to deal with you know, things like you know, auditing the tables on a time axis capturing like watermark, so that you can sort of do incremental processing later and so on all these things have to be manually done. So those are the kinds of automation that are built in slowly changing dimensions or a click of a button you know, click a button and then it it organizes the data into in the data model in in current and history tables and so on, or it capture lineage versioning is all performed on those data transformation pipeline and once the data transformation you know, logic is built out the final set of As your data data model and the data model, you can also accelerate into an in memory data model. So that way, you're not only sort of get your raw data, but you're shaping the data into the right format in a data model, but also creating the data model at the right speed, you can get, you know, a data model in our in memory system is going to be eight to 10 times faster than on your, you know, base system. So those kinds of accelerated data models you can do it is, you know, right in the platform itself, and it also provides access to integrate into your ml and AI algorithms, as well, you know, and supports things like training of your data models. And once these things are, you know, built out the data models are built out nicely, and it represents your your business, you can export those data models into other consumption endpoints. It could be, you know, a big query in the cloud or snowflake or some of the technologies that you're using to consume those data models while you're continuously refreshing it and maintaining it and delivering that data models for consumption, or it could be individual tools like Tableau on and click on and so on. And the last stage of of this is to orchestrate these pipelines so that you are running them every 15 minutes or every hour or every day. And, you know, manage dependencies parameter passing, making it fault tolerance with, you know, retry and restart ability, and, and, and managing it on a day to day basis. So we have an orchestrator that can be used to orchestrate these pipelines and deliver the end result. And all all of this is built in a single platform. So that way, you can have data engineers, data scientists, data analysts, and production support or production admins, all work on a single collaborative system.
Tobias Macey
0:22:44
There are a few things that I want to pull out from there. One of those is the concept of the change schema capture that you mentioned, and being able to propagate those schema changes throughout the different systems, which I know is one of the canonical problems that data engineers are faced with is How the evolution of data in the source systems can be reflected downstream. And I'm wondering what you have found to be some of the useful strategies on that point and some of the challenges that you've faced in terms of being able to reflect those schema changes in a way that is non destructive at the destination systems?
Amar Arsikere
0:23:18
Yeah, absolutely. This is a great question. So what one of the things that we did, I think that's a strategy that we are followed in our system is we provide a automated change schema capture, but at the same time, you know, we provided also in a way that you can be notified and and you know, there can be a human involvement and authorization before those changes are automatically applied. So, that is a strategy we have applied across the entire system, especially for the schema change capture. So that way, you know, you you can automate certain kinds of changes into the system. And then you know, for other kinds of schema changes, you can also make it manual in the sense that the Know You have to get notified you get, you need to be authorized or approved before those changes are percolated into the system. So that strategy has helped. Because there is no one size automation fits all kind of a story, especially when it comes to schema changes. And there are in many cases, a change in schema that should not have been not really performed in the source systems. And that shows up in your data warehouse systems, in many cases requires that you no human intervention. So that's one of the things that we have learned that
Tobias Macey
0:24:33
as well got pretty good. And then another element is the data cataloguing. And I know that discoverability and reuse of data assets in general is one of the biggest challenges that face organizations of all sizes, and I'm sure that that is compounded with the different business units that exist within an enterprise organization. And I'm wondering how you address that discoverability in your platform to make sure that you can cut down on rework and duplicated effort between different silos.
Amar Arsikere
0:25:03
Yeah, you know, the data catalog is is, is something that is central to the organization. I mean, it truly represents, you know, the data assets that an enterprise needs to manage. And, you know, it has to have a lot more metadata information for, you know, managers of the data. So one of the things that we have done is, I mean, you know, the data catalog is, you know, searchable and taggable so that's, like, you know, pretty much most many systems have it, we also have a mechanism to enhance the data catalog with you know, both technical glossary as well as Business Glossary so you can upload you know, Excel files that represents your, you know, the business metadata and then the system tags, the business metadata into those are columns and tables and so on. And again, it everything becomes searchable as you're uploading and enhancing the system. And that's one part of it. The other thing that we have also done is called as data engagement dashboard So as, as you're using this data day in and day out, whether by running the pipelines and building data models and orchestrating and running it in production and so on, we are also capturing and showing what are the most critical important data assets within your organization. So, we are calling it a data engagement dashboard. So, you can see what are the top five data models, what are the top five raw data sources and, and who are the most heavy users of your of your data and so on. So, you get a view for what are the critical data assets in an organization and, and, you know, data assets are, you know, just like any other, you know, sort of physical assets that a company may have, whether like, you know, it could be stores or machinery or equipment and so on. There has to be a management layer. And, and that's how we are viewing it. And we are putting a lot of, you know, sort of emphasis on that data engagement dashboard for the managers of data
Tobias Macey
0:26:54
for the different roles that are interacting with info work. So I'm wondering if you can Give a high level view of the different responsibilities within an organization that might be using if it works in their day to day and the different workflows that they might engage in with your platform.
Amar Arsikere
0:27:14
Yeah, so we have a platform supports this role based access control. So typically what we have seen is in our data engineers are mostly interested in building that onboarding of their data source and building out their the data catalog. So typically, that's where they are focused on. So data engineers build out the data catalog. They then provision in a what we call a state of domains, which is basically a subset of data catalog is then provision for a certain set of users for them to like build transformation pipelines and so on. The second set of us users are data analysts and data scientists, so they the data scientists and data analysts in our day are working with a partial set of data catalog and then building on top Information pipelines. They're shaping the data and creating the data models within what we're calling it as data domains. And the final data models are, you know, when they're ready for consumption, they're typically working with production admins, the production admins, then take the pipelines and the artifacts that are created and then migrated from dev to production, right? The production systems are typically managed by production admins, and they are then orchestrating and, you know, deploying it in production. And they're also monitoring that in production. So infoblox provides a orchestrator where they can, you know, monitor and and see how things are going. They can also look at the SLS, once it is deployed in production and make certain changes if needed, whether by passing parameters or by increasing capacity, like you know, if a process is going very slow, and it's taking a long time they can they can add more compute into that process and get the SLS to run within the thresholds that they have said. These are the three different kinds of roles. It's the data engineers who are interested in building the data catalog, the data analysts and data scientists who are interested in building out data models that is shaped for business consumption and the production admins are orchestrating the workflows in production.
Tobias Macey
0:29:16
And throughout those different stages, can you talk through the overall lifecycle of a unit of data, whether that's a single record or large batch of data and how that flows through the info works platform and the different subsystems that it touches on at each stage of its lifecycle.
Amar Arsikere
0:29:31
So and the life cycle of the data or a unit of data in infoblox is first part of this is that when a data engineer is sets up a connection to a data source infoblox is crawling and and understanding the metadata of that of that data, which is all the structure and the information and the data pattern is pulled in. And then the second part of it is to actually ingest the raw records itself. So the raw data is, is ingested. The first load is, is the historical load where you're bringing in all of the data sets for a certain table on an artifact, schema, and so on. And then the second phase of those, ingestion is the CDC, where you're bringing in bringing in the changing data and data sets. Now, if this data represented a new record, it will be handled in one way. If it if it was an updated record, then it goes and it gets processed in our merge process. So we have a continuous merge process, which gets you know, essentially takes the updated record and modules into the base sort of, you know, tables that's on the on the big data environment, and this data gets tired. So there is a time axis on which this data gets tagged. And it may be also pushed into current tables or history tables based upon if this was an old record, versus if this is the current record that is not represented on the source side. So that's all the things that happens to a unit of data. When when you're building out a data catalog, so that's the first phase of this whole thing. The second part of this is the data transformation. So, once the data is brought in and nicely organized on a time axis and catalog, there is a transformation pipeline, which is essentially applied on top of your of your data. And that transformation pipeline has certain kinds of operations like you know, join nodes, and you know, unions and other kinds of transformations that you may be applying. Before that data shows up in a data model. There is a lot of the management that happens in a transformation pipeline, like if you need to know who made what changes to the transformation logic, lineage, versioning, and all those things. So that's, that's orthogonal to the flow of the data itself. We are keeping track of everything that happened to that transformation logic. And once the data shows up into the data model, you know, the final step is that it could then be exported in a into a consumption layer where it could be some cloud endpoint. Or some other sort of sequel engine or things like that. And then as we discussed earlier, the finance part of this whole thing is orchestrating it in production. So this is one part of like the, you know, the data flow. But once you have this operationalized in production, you're running these things every hour or every 10 minutes or whatever frequency that you need to manage
Tobias Macey
0:32:21
the interface that is available for the data engineers and the analysts to be able to build out these different data flows and interact with their data is largely UI oriented with low code approach and being able to click and drag different components. I'm wondering what you have found to be some of the design challenges that you face in being able to provide an appropriate level of control and expressivity while still being able to make it accessible to people who don't either have the technical capacity or the time to be able to dig deep into the code level aspects of the systems.
Amar Arsikere
0:32:57
So the interesting thing is, you know, audio subsystem is built upon our API layer. So you're right, all of the things that three different personas or roles are using infoblox. They are using this collaborative UI platform and managing their work, whether it is to build out data catalog or transform transformation, or operationalizing. The data and in some cases, there are some of our customers are also using this platform without a UI using using a, an API to drive the pipelines in production. So that's another sort of a usage category as well. So one of the things that we have done is there is a challenge in terms of making it completely UI, because some in some cases, you need to have programmatic access to various points within the platform. And we have an ability where, you know, since our entire platform has got API's, you can sort of tap into different parts of our system and and, and do an integration. One example is if you have Have a custom transformation code and you want to be able to sort of use that you can you can drag and drop and you know, put in a custom code inside of infoblox and make it into a node that becomes a reusable node, and you can, you know, pass parameters to it and so on. So, this this is one approach of reusing some of the things that you that you may already have, we also support integration using our orchestrator which you can call it a loosely coupled integration, where you can you know, as part of your transformation, you can pass control to your system that is a pre existing system, pass parameters and so on. And then, you know, have have lunch that is completed have have it come back into infraworks and continue on. So, this way, you know, you have a mechanism for you to integrate into sort of existing systems.
Tobias Macey
0:34:50
And then in terms of the workflow of the people who are defining those pipelines, what do you have available for being able to handle versioning of them and releasing them to different environments to validate them and test them or for being able to look back at past versions and do historical analysis.
Amar Arsikere
0:35:07
There is a built in version control for the pipelines. So you can have data analysts work on a new version, while there is an existing version that's being run in production. And once it's the development is complete, you can move the new version or migrate that new motion into production. So you have the ability to sort of target version eight and all those things. We also support integrating into your sort of existing config management tools such as GitHub. So those are things that are available. And the one of the unique things is this since this is a platform that's API driven, you can also build ci CD on top of it and perform things like data validation and data reconciliation. We support things like record count based data reconciliation out of the box. So if you need to, you know, run a reconciliation job at the end of every week and end it you know, you need to just click a button and those reconciliation And processes will run. And you can also perform sort of custom data validation, you can specify slices of data. So things like, you know, if you if you want to take a slice of your data, and you can say, Show me the number of sales that happened in this zip code, and that should be a slice of your data that you always want to validate. At the end of a certain day, you can specify those slices of data, right, and then it generates the reports and statistics to make sure that everything is is running, running fine. And that way you can you can sort of watch in the pipeline and validate it and, and manage this on an ongoing basis. And then the graphical paradigm can sometimes break down because of the fact that there are elements that are difficult to translate into a UI paradigm or that require some specific custom development to be able to handle and I'm wondering what you have seen as being some of those edge cases where there's the necessity to be able to drop down to the Third Level and what you have as far as an escape hatch in your platform, you know, in cases when the customers, you know, they have to perform a custom transformation, and they, they either have an existing code base they want to use or you know, they want to write code. In some of the cases, what we support is an ability to do custom transformation node. So we have a mechanism in which we can support adding custom transformation code, whether it's in Java or Scala or Python and and you can code that in and then drop it into a in a certain format into infoblox. And it becomes a reusable Dragon droppable node. And you can attach that within the info vaults in our platform, you can drag and drop it, build it out and reuse it in various ways that you may want to that you may want to use, and that way you're leveraging your existing assets. So it's not something that you have to record everything in this new environment. And the other thing that we also do is that we support Integrating you know Java based transformation nodes with Scala or Python based transformation node without having to go back into the disk. In other words, we are doing in memory transformations, you can say passing data frames between these nodes. So, there is no penalty for you know, for you to write it in different languages
Tobias Macey
0:38:19
and then for and what are the cases where info works is the wrong choice for our platform for handling data orchestration and integration.
Amar Arsikere
0:38:29
Yeah, well, this is a great question. So, one of the things is if this is infoblox is a platform for you know, building out data operations and data orchestration for batch style, you know, use cases and you know, microbiotic style use cases. So, if you're looking to do real time analytics with millisecond sort of response time, then this is not the right choice of the platform. You know, if you if you are looking to perform a use case, which requires data models to be updated, made available in few minutes, I would say to minutes and above, then this would be an AI choice of technology.
Tobias Macey
0:39:03
And in your experience of building and growing info works as a business and as a technical platform, what are some of the most challenging or interesting or unexpected lessons that you've learned in the process?
Amar Arsikere
0:39:14
Yeah, you know, one of the things is as we have discussed, I mean infoblox is a highly automated system. And, and one of the things that we have learnt is automation by itself is is is also not sufficient. And as a result, we have invested in, you know, customer success within our enterprise customers. And we have customers who have built thousands of pipelines and run in production in a 400 plus use cases in 12 months, so, they've able to gain the agility that they did not have before infraworks and that we were able to achieve, I mean, automation was definitely the root cause and the and the, the reason why that agility, you know, came in to play but we have also invested in training the customer and advising them In this new paradigm of, you know, doing data operations and orchestration, and I think that was, I would say, a lesson learned in terms of, you know, how automation plays out, you still need that education, that advisory sort of a role. And we have heavily invested in that. We're also investing in a lot of the self service capabilities, where like, you know, that is like sort of tutorials, video tutorials and other things that we are investing and also recommendation. So, when data analysts and data scientists are performing certain operations we are under we are essentially analyzing what they are doing and recommending in the app itself or in the platform itself to guide them. So these are things which we have learnt as a result of working with large number of customers, and all those heuristics the rules of like, you know, how people are using our system is, you know, we are applying machine intelligence to, you know, sort of recommend what they should be doing so, and I think those things are having a bigger impact or or, you know, I'd say I think a big impact in usage as well
Tobias Macey
0:41:02
what are some of the most notable ways that the overall Big Data landscape has evolved since you first began working on info works? And what are some of the industry trends that you're most excited for?
Amar Arsikere
0:41:12
Yeah, since we started over I started infoblox I think one thing we have seen is that you know, that is a secular movement towards more efficient technologies, when it comes to you know, distributed execution engines for data processing. So, what I mean by that is, you know, we had this open source Hadoop which which, which started with Apache Hadoop. And then after that, the SPARC technologies which came in which, you know, essentially made MapReduce sort of work within, within memory. And then now we are seeing ephemeral clusters, which means you don't need to have a, you know, one big large static cluster, you can, you can have a cloud cluster start up and and taken down for each sort of workflow. So, all these things are essentially solving two things. One is the complexity of managing A cluster. And second thing is the the cost or the efficiency of data processing, you know, being in this world of data for a while now, I think one thing which has become clear is that there is more data today than all of the data computation ability as human beings, right. So as it says, there is a need for much more efficient data processing technologies. And that's a movement that's going to be happening. And and what we have done at infoblox is to really build this data operations and orchestration platform with an infrastructure abstraction. So we can run on any kind of execution engines and storage technologies. And that's one of the things that we see that's, that's very exciting, that you're going to see more many more technologies for data processing.
Tobias Macey
0:42:47
And are there any new features that you have planned for the near to medium term or overall improvements or enhancements to the platform that you'd like to share before we close out the show?
Amar Arsikere
0:42:57
Yeah, I think one of the things that we are working on in our roadmap is the Self Service initiative where there is going to be, you know, sort of a guided tutorial as you're using the product recommendations for various things you can do. And we have learned a lot from many large customers using. And I think one thing we have, you know, seen is that our customers are running thousands of pipelines in production. And we have a lot of heuristics and a lot of things that we have learned from this. And we are using machine intelligence to sort of surface it as people are building our pipelines and using the product.
Tobias Macey
0:43:32
Well, for anybody who wants to follow along with the work that you're doing or get in touch or get some more information about the platform that you've built out. I'll have you add your preferred contact information to the show notes. And so with that, I'd like to ask you a final question of what you see as being the biggest gap and the tooling or technology that's available for data management today.
Amar Arsikere
0:43:51
Yeah, one of the biggest gap that I see in the tooling for data management today is that data operations are fragmented because it's There's a lot of tools, and you have to write a lot of glue code to stitch them together. And automation, doing automation of data operations in such a setting becomes very challenging. That's one of the biggest gap that I see. And you know, there was one research paper from Google, which said that in a typical sort of ml use case, 5%, for about 5% tends to be ML code and 95% is Glue Code, right. And I think having a platform that integrates all of the technologies, or all of the functionality that is required from onboarding a data source, transforming it, and operationalizing, it becomes very important.
Tobias Macey
0:44:36
All right, well, thank you very much for taking the time today to join me and share your experience of building out the infoblox platform. It's definitely a very interesting system and one that solves a very complicated challenge for larger organizations and for any organization. So definitely appreciate the work that you're doing there. And I thank you again for taking the time and I hope you have a good rest of your day.
Amar Arsikere
0:45:00
Thank you for having me. Toby is
Tobias Macey
0:45:07
listening. Don't forget to check out our other show podcast dotnet at Python podcast comm to learn about the Python language its community in the innovative ways it is being used and visit the site at data engineering podcasts comm to subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects from the show, then tell us about it. Email hosts at data engineering podcast.com with your story, and to help other people find the show. Please leave a review on iTunes and tell your friends and co workers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!