Summary
There are a wealth of options for managing structured and textual data, but unstructured binary data assets are not as well supported across the ecosystem. As organizations start to adopt cloud technologies they need a way to manage the distribution, discovery, and collaboration of data across their operating environments. To help solve this complicated challenge Krishna Subramanian and her co-founders at Komprise built a system that allows you to treat use and secure your data wherever it lives, and track copies across environments without requiring manual intervention. In this episode she explains the difficulties that everyone faces as they scale beyond a single operating environment, and how the Komprise platform reduces the burden of managing large and heterogeneous collections of unstructured files.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.
- So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan.
- Your host is Tobias Macey and today I’m interviewing Krishna Subramanian about her work at Komprise to generate value from unstructured file and object data across storage formats and locations
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Komprise is and the story behind it?
- Who are the target customers of the Komprise platform?
- What are the core use cases that you are focused on supporting?
- How would you characterize the common approaches to managing file storage solutions for hybrid cloud environments?
- What are some of the shortcomings of the enterprise storage providers’ methods for managing storage tiers when trying to use that data for analytical workloads?
- Given the growth in popularity and capabilities of cloud solutions, how have you approached the strategic positioning of your product to capitalize on the market?
- Can you describe how the Komprise platform is architected?
- What are some of the most complex considerations that you have had to engineer for when dealing with enterprise data distribution in hybrid cloud environments?
- What are the data replication and consistency guarantees that you are able to offer while spanning across on-premise and cloud systems/block and object storage? (e.g. eventual consistency vs. read-after-write, low latency replication on data changes vs. scheduled syncing, etc.)
- How do you determine and validate the heuristics that you use for understanding how/when to distribute files across storage systems?
- How does the specific workload that you are powering influence the specific operations/capabilities that your customers take advantage of?
- What are the most interesting, innovative, or unexpected ways that you have seen Komprise used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Komprise?
- When is Komprise the wrong choice?
- What do you have planned for the future of Komprise?
Contact Info
- @cloudKrishna on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Today's episode is sponsored by prophecy dot I o, the low code data engineering platform for the cloud. Prophecy provides an easy to use visual interface to design and to deploy data pipelines on Apache Spark and Apache Airflow. Now all the data users can use software engineering best practices. Git tests and continuous deployment with a simple to use visual designer. How does it work? You visually design the pipelines and prophecy generates clean spark code with tests and stores it in version control. Then you visually schedule these pipelines on airflow. You can observe your pipelines with built in metadata search and column level lineage.
Finally, if you have existing workflows in AB Initio, Informatica, or other ETL formats that you want to move to the cloud, you can import them automatically into prophecy, making them run productively on Spark. Learn more at dataengineeringpodcast.com slash prophecy. Your host is Tobias Macy, and today I'm interviewing Krishna Subramaniam about her work at Comprise to generate value from unstructured file and object data across storage formats and locations. So, Krishna, can you start by introducing yourself?
[00:02:04] Unknown:
Yeah. Thanks, Tobias. Yeah. I'm a cofounder and COO of Comprise. We are a data management company headquartered in Silicon Valley,
[00:02:13] Unknown:
and I'm 1 of 3 cofounders of the company. And do you remember how you first got involved in working with data?
[00:02:19] Unknown:
Yeah. We've been working with data nearly for 30 years now. And myself and my 2 cofounders, our background is distributed computing. A lot of the problems of how to manage data can be solved with good distributed scale out architectures. So we did 2 companies prior to Comprise, both of which were acquired. And then we started Comprise based on feedback from our prior customers.
[00:02:44] Unknown:
So in terms of what you're building at Comprise, can you give a bit of an overview about the problems that you're trying to tackle and some of the story behind how the company came to be and why you decided that this was the problem space that you wanted to spend your time and energy on?
[00:02:58] Unknown:
Yeah. You know, it's very interesting. We all intuitively know that data growth is exploding. You can see it in your own life. I mean, we're recording this video now. There's video and audio that we are generating. You're probably generating a lot of audio with all your podcasts. You're taking a lot of pictures on your phone. You're probably going to your the doctor if they're doing an x-ray or an MRI, Or you get in a car and you're driving it. If it's an electric car, you're generating IoT data. So you can see in your own life how much data each of us are generating. So probably everybody knows we're generating a lot more data today than ever before.
What we may not realize is that 90% of the data in the world today is what we would call unstructured data, meaning that it's not data in a database. It's not data organized neatly in rows and columns. It's things like audio files, video files, genomics files, imaging files, you know, all these kinds of data. And this is not how it used to be. You know, before 2010 or so, most of the data in the world was actually structured. A big shift happened around the early 2010s. And from then on, if you look at the growth curves that IDC and others have, all the explosion has been in unstructured data.
And we started Comprise because many of our customers from our prior 2 companies came to us and said they were caught by surprise with this tremendous growth of unstructured data because you cannot manage it the same way that you manage a database. And so vendors were not addressing that gap of how to manage unstructured data. And we realized that managing unstructured data is not a storage problem. You actually need a layer above storage, and you need a distributed layer that works across different clouds, across different storage that fits well with our background. That's why we started Comprice.
[00:05:03] Unknown:
And to your point, there are a few different axes around the management of this unstructured data where 1 layer is the actual storage location of it, where it might be block storage on spinning disks, or it might be object storage and something like s 3. Then there's also just the metadata about what are the file names, what are the attributes of the files, where are they located, and things like that. And then there's also potentially more rich data that you can extract from those unstructured files to understand, okay, this is a PDF, and it has some graphics that I might be able to incorporate into some other product. So there there's many different aspects of that. And I'm curious which aspect or aspects you're focused on addressing in the work that you're doing at Comprise?
[00:05:50] Unknown:
Yeah. As you said, definitely, you know, storage of data is important because you wanna optimize it for the kind of data that you're holding, and you can address price and performance. And there's been a lot of innovation in storage, you know, from the Cloud vendors and the different storage vendors around unstructured data, you know, particularly file storage and object storage. The piece we are addressing is the layer above that where how do you know what storage is right for this data at the right time? How do you know what data to pull out of a data lake and do analytics on? You know, how do you make it easy for users to find data no matter where it lives? Because data is scattered around so many silos. Right? So that is what we call data management.
So data management is providing a consistent, systematic way to search across all your data, to understand all your data, to right place data, and to execute functions on data. And those are things that go beyond any single storage silo because data is scattered in so many places. That's why a separate layer that can just give you a view of the data regardless of where it lives and mobilize that data, you know, is required. That's what we call data management. It's analytics and data movement and data extraction.
[00:07:16] Unknown:
In terms of that sort of discoverability and organization capacity, there's another company that I spoke with recently called Unstruck that's focused on building a sort of metadata lake for your unstructured data sources. And I'm curious if you're familiar with that company, and if so, how you would characterize
[00:07:35] Unknown:
the work that you're doing versus the way that they're approaching the problem. I am not familiar with those guys in particular, but I will tell you that there are data lakes, Like, you know, Azure itself has a data lake. Amazon has a data lake. I think the problem that a lot of companies run into is that, you know, if the data is not optimized in some way and unstructured data doesn't have any specific format to it, it has all kinds of different formats, then how do you know what to find? How do you know even where to look? So a lot of what Comprise does is you just point Comprise at any of your cloud accounts and any of your data centers.
Compress finds all the data in all these places. It indexes all the data for you. It gives you analytics on all the data. And we didn't have to move it into our own metadata lake. We didn't do anything like that. Your data is wherever it is, but it's giving you a view of everything. And then you can search. And you could say, oh, I see that this data hasn't been touched in over a year, and it's consuming Flash storage, that's a mismatch. Let me put that in the Cloud. Or I know that for my legal hold, I need to keep all the data from this user related to this project in an object locked bucket for 5 years. Let me have Compress move it there and lock it for 5 years. After 5 years, I can delete it. So we enable discoverability of data wherever it lives. And then systematically, by policy, we move the data.
And then we also enrich the data with tags. So basically, we're providing kind of an index, if you will, an actionable index across different repositories.
[00:09:22] Unknown:
Yeah. The note of being able to manage the life cycling of the data and being able to apply a policy to say this source of information has this compliance regimen that we need to follow. So we need to make sure that we maintain it for x number of years before we can actually delete it, and we don't have to worry about, you know, lack of institutional memory or fat fingering a particular command to risk accidentally deleting data that we have to maintain for legal purposes.
[00:09:49] Unknown:
Yes. Exactly. Yeah. We believe in something called actionable analytics. You know, there's probably different ways you might be able to get analytics on your data or information about your data. But, you know, at the space of unstructured data, a petabyte of data is probably a couple of 1, 000, 000, 000 files. And if you have 10 to 100 terabytes of data, you're dealing with, like, 100 of billions of files. There's absolutely no way anybody giving you just analytics will help you much because for you to take some action on it, it's cumbersome. That's why Comprise does analytics, indexing, and mobility in a single solution.
To your point, you could just set a policy and say, hey, this data is for legal hold. I'm putting it here. 5 years later, confine it. And you just set the policy once. Compreze will do it all for you. You don't have to worry that people didn't remember or something happened. Compress will put it in that bucket. 5 years later, it will pull it out of there, put it into your trash bucket, and ask you, hey, are you ready to delete it? You know, so all of that will automatically happen.
[00:10:57] Unknown:
In terms of the target users and use cases that you're focused on, I'm curious how you think about the personas of either the individual end users and roles that they fulfill or the sort of categories of organization that would benefit most from the product that you're building?
[00:11:16] Unknown:
Yeah. It's a great question. So we address 2 sets of use cases. 1 use case is more on the infrastructure side. So typically, we go in to somebody who owns the infrastructure budget for data. And so it might be like a VP of IT infrastructure, VP of cloud infrastructure, or VP of global storage. You know, depending on the company, those roles titles might be different. But, basically, they're looking at the cost of their storage and data protection and data management. They're looking at cloud transformation. They need a way to cut 60, 70 percent of cost while handling data growth. And then they need a way to make themselves more agile and deliver data as a service.
And so that's who we go into and they're usually our champions in an organization. And what they do is they see the value of this index and they bring in their departmental IT. They bring in their legal IT. You know, they bring in these other teams that are able to use the data for legal hold or for, you know, feeding big data and machine learning analytics or for doing things like, you know, deleting obsolete data. That's another problem a lot of these companies have. Obsolete data can be a liability in a company. So identifying it proactively is important. So the short answer is infrastructure IT or cloud IT is usually the starting point for us.
[00:12:45] Unknown:
As far as the ways that these sort of enterprise IT and, you know, data management teams within these organizations are handling the problem of file storage, file indexing, you know, organization, categorization, some of these access controls and life cycle policies. What are some of the general approaches that they have typically been using to tackle this range of problems related to this storage of, you know, unstructured data and file objects and data assets?
[00:13:20] Unknown:
Yeah. That's a great question, Tobias, because, you know, when unstructured data's footprint was pretty small, you know, basically, you could just rely on your storage vendor to handle it for you. If you have very little data and you just have 1 storage vendor, 1 architecture in 1 place, you kind of work out a good price with them, and the problem is just not big enough to go and worry about anything else. And that has significantly changed today because data outlives storage. In most enterprises, data has a minimum lifespan of 25 to 30 years, sometimes much longer.
And most of your storage purchases or even Cloud contracts, you're looking at a 3 to 5 year time frame. So your data is going to go through several iterations of storage, several iterations of backup. Technology is gonna change significantly in 30 years. So do you really wanna be locked into all these silos of data management? And more importantly, do you want to have low visibility into your data across this? Because increasingly, it's not just about storing data. It's about maximizing the use of that data 30 year lifespan. The requirements have changed significantly.
The problem has gotten more complex. There are way more options available to customers. And they are being asked to deliver a data service, not a storage service. So that's why, you know, data management needs to be independent from storage.
[00:14:56] Unknown:
As to the shortcomings that exist when you are relying on the storage layer to be the kind of gatekeeper for these use cases and these requirements, What are some of the issues that come up when in terms of the, you know, wasted effort or issues with failing to comply with your regulatory requirements or just issues with losing data because you don't know where it is or maybe it gets deleted accidentally? Just some of the issues that arise because of the fact that you don't have this more holistic approach to the data storage and management.
[00:15:32] Unknown:
There's several things that you kind of highlighted there. The biggest, I think, is that there's a lot of missed savings opportunities, And there's a lack of flexibility. You know, just imagine, like, you're recording from the studio, and you probably might go to the kitchen to go get get a meal at some point and you're free to move around your home. Right? And you're free to go anywhere and you might leave the house at some point. And, that's because you're autonomous and you're not controlled by the house that you're living in. For data, it's not like that. Right? Data is put in storage and the storage is trying to move you around. And, you know, it's trying to tell you that you can't leave this. If you leave it, you know, there's a cost. You know, all the data has to be rehydrated.
And so that's not ideal. So from a customer perspective, customers lose about 60 to 70% of the savings that they could get if they had a data management solution versus just storage centric because there are way more options available to them. So as a simple example, if you put data in a cloud file storage, the file storage may only have limited number of tiers, but the Cloud may have 30, 40 tiers available to you. And some of those tiers are, you know, orders of magnitude less expensive than the tiers in the file storage. So data management will move your data to those cheaper tiers, But the file storage solutions won't. They'll only keep you inside the file storage tiers because they don't make money if they move let you move out of their environment.
So there's a lock in that basically you get locked in to higher costs and you get locked out of native capabilities of other platforms.
[00:17:15] Unknown:
And as to this cloud migration aspect, as you mentioned, a lot of these enterprise organizations and traditional IT are going to be dealing with storage vendors, or maybe they've got, you know, a large SAN array or a NAS that they're dealing with, or they're dealing with a flash array storage system. And as they're starting to move to the cloud because different business units have different operational requirements, they wanna be able take advantage of the elasticity, or they need to be able to iterate quickly, and they don't wanna have to deal with some of the acquisition times required to, you know, rack and stack new servers. What are some of the complexities that come about when they are trying to span across these different operating environments where they have these on premise data centers, they have the in house knowledge and capacity to manage some of those storage solutions, but then they also need to be able to manage a unified interface or unified access layer and governance of this data as it moves into these, you know, private cloud or public cloud environments, how does that manifest as far as the scope of responsibility as to who owns that problem and how they might approach the sort of collaboration and enforcement of those different requirements.
[00:18:33] Unknown:
It's a very interesting point because, you know, typically, IT is a custodian of data. You know, IT is asked to store data, protect it, move it to the cloud, and do all those things, but they are not the users of the data. So department users, line of business users are the users of the data. They are the ones that know what should actually happen with the data, what data is actually important to them, what data, you know, they want to use where. Right? And today, you know, it is very imperfect because when you manage data through silos and you lack visibility to data, all decisions are ad hoc. You know, somebody comes and tells you, hey, I need the best storage for my data because I'm gonna run this massive job. You provision, like, expensive Flash storage in the cloud or whatever for them. And then, you know, they're done with that analysis in a month that they never tell you because, you know, that's not their job. They moved on to something else, and that data is still consuming all those expensive resources. You know, so for IT, the challenge, I think, is how do they provide an environment that's flexible enough that different departments and different users can have different policies for their data based on their needs?
And yet, how can IT have central visibility across Cloud and data center into what they're doing and how data is being managed so they can enforce things at a central level and and govern at a central level but not get in the way of users using the data. Because ultimately, it's all about improving the productivity of the users. And again, that's where data management is important because a system like ours, for example, we basically move data without getting in the way. We never get in the path of the data. Users think the data is still there. They can use it from wherever.
All their applications continue to work, whether it's in the cloud or not, because we manage kind of providing that transparency. We have a patented way of doing that across file and object storage. And so we give IT the visibility into, hey, where is data? Who's using it? How much it's costing? Where do you want to put it? All those things IT can manage, but they are not getting the way of users. For users, users that have full access to their data, users can set, you know, rules on what they want, and they can search for data. That all is completely available to them. So it's a way for IT to collaborate with users rather than have this friction between IT needs and user needs, which are not always, you know, the same priorities.
[00:21:19] Unknown:
StreamSets' DataOps platform is the world's 1st single platform for building smart data pipelines across hybrid and multi cloud architectures. Build, run, monitor, and manage data pipelines confidently with an end to end data integration platform that's built for constant change. Amp up your productivity with an easy to navigate interface and hundreds of prebuilt connectors, and get pipelines into new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you're up and running, your smart data pipelines are resilient to data drift, those ongoing and unexpected changes in schema, semantics, and infrastructure.
Finally, 1 single pane of glass for operating and monitoring all of your data pipelines, the full transparency and control you desire for your data operations. Get started building pipelines in minutes for free at data engineering podcast.com/streamsets. The first 10 listeners that subscribe StreamSets professional tier will receive 2 months free after their 1st month. To the point of cost management and visibility, 1 of the other challenges that always comes up when you're trying to manage organization and cleanup of data is the issue of duplication or understanding, you know, who created what when.
And particularly from the duplication and replication standpoint, I'm curious how you have approached that challenge of being able to say, okay. I have this file. It lives in this flash array on premise in this data center, but now I need to be able to access it from, you know, this cloud service to be able to do some, you know, machine learning algorithm on it. And so I need to be able to copy it over to this s 3 storage location to be able to merge it with these other data assets. But then when I'm done with it, I wanna make sure it gets deleted. And just some of those complexities that come up when you do have to move data around multiple places and managing some of the latency requirements around it, the cleanup afterwards to make sure that you only have 1 sort of canonical source of truth for a data asset. And in the event that that data asset changes and it's still being used by some other system, making sure that that information gets replicated to those different places.
[00:23:30] Unknown:
Yeah. No. That's great. It's a very important use case, actually. Exactly. So we allow you to, like, search and find data and have comprised you know, maybe copy that data to a location. We have something called deep analytics actions and then manage the ongoing life cycle of that. Because to your point, otherwise there's that other copy and it's just sitting there consuming resources forever because nobody thinks about it after they finish their analysis. I mean, that's 1 thing we as humans, we're naturally we're hoarders. Right? I'm a hoarder. I never delete anything. And, you know, that's our nature. And and it's not our job all the time to ask users to go running around cleaning things up. That's a waste of their time.
That's where systematic processes should take over and should be able to handle it for you. So Compress does that. It not only will make that copy of the data, but based on the policies you set, it will also do the cleanup of that. And if you did do some analytics in that other environment, we have ways to actually enrich the original data with tags. So all the output of that analytics doesn't get lost in the copy. It can be put back on the original source of truth. So not only do we maintain the consistency of the data, but also we're continuously enriching data as it goes through different processing. Yeah. And, you know, the other use case you didn't mention but is also very important is that most organizations by design keep multiple copies of data because either for ransomware reasons or disaster recovery reasons or backup. You know, if some suddenly a virus hits and you would go to an older version. But not all data needs 5 to 10 copies. Because if it's never been touched in a year or more, there's a cheaper way of getting that protection.
And so Comprise actually rightsizes the backup for you. So you can cut 80% of the backup footprint and cost by just doing a more passive way of leveraging durable solutions like object storage and object locking on cold data and then using the high end backup for hot data. So CompRez does all those things so that you have affordable backup and affordable ransomware protection. Particularly with ransomware, the costs are skyrocketing for companies.
[00:25:56] Unknown:
Yeah. Those are definitely useful and interesting problems. And most of the time when people are talking about big data and massive analytics, they're usually ignoring the question of backup because, you know, if you have to have 2 copies of petabytes or exabytes worth of data, then that's a massive problem to solve. And so a lot of people just don't bother solving it. And so it's definitely interesting to think about some of the different ways that you can address that problem where maybe you'd have to have 2 copies of the data everywhere. You just need to make sure that that 1 copy that you have is sufficiently protected so that it's never going to get deleted or modified.
[00:26:31] Unknown:
Exactly. Or you use, like, geodispersion within the storage. You know, you use a second zone within it. If there are lower cost ways of getting protection, it's not the right answer for all data, for passive data that has never changed in a long time, you know, you can right size that protection.
[00:26:52] Unknown:
Before we start to dig into the technical aspects of what you're building, another sort of business level consideration that I'm interested in is the way that you think about the strategic positioning of the comprised product in relation to all the different players in the ecosystem where you have the different storage vendors, you've got the cloud providers. And looking at your website, I noticed, you know, we have partnerships with I think it was Amazon, Microsoft, Google, you know, a whole bunch of different vendors. And so I'm curious how you think about that positioning of being this utility layer that is agnostic to that storage underpinnings and being able to play nicely across the whole ecosystem?
[00:27:33] Unknown:
At Avaya, you know, at a technical level, we are believers of being standards based. Everything we do is standards based and the interfaces are all nonproprietary, meaning that we work with standard file protocols, object protocols. We put data in native form in all the storage. The reason, you know, all the cloud providers like to work with us is because when we move data to their Cloud, we put it in the native Cloud format. We don't lock it into a compressed format. So you can directly go and use your data in the Cloud. You can directly run Redshift in Amazon. You can directly run Azure Analytics in your Azure bucket. You can do all of that without going through Comprise or without going through your file storage.
We believe that data should be in the control of the customer. It's their data, and they should be able to maximize the use of the data wherever it goes. You know, how do we partner with these different vendors? You know, we work very closely with the major Cloud and file storage providers and we actually expose a lot of their key capabilities in a simple way for customers. So as an example, you know, Amazon has currently almost 16 different tiers and classes of file and object storage. And we are the only ones who can leverage all of them. So we can put a file into Amazon FSx and when it's no longer being actively used, we can tier it transparently to Glacier Instant Retrieval, which is about 40 times cheaper than the Flash layer of FSx.
And then because we've indexed that data and we keep it in native s 3 format, we can promote that data back up into an EC 2 instance and run a data, you know, red on it or run Snowflake on it. So we can we move the data up and down in different directions consistently without lock in. And so for the Cloud providers, we provide a simple way for the users to use the richness of the services they provide because it's actually overwhelming how many services you can get from just 1 cloud. And if you manually write things to use each of those services, innovating at a breakneck speed, and it's impossible for customers to keep up sometimes.
[00:30:01] Unknown:
Digging into the technical layers, I'm wondering if you can talk through some of the ways that the Comprise platform is actually architected and the technical components that you're using to be able to provide the, you know, discovery, search, transportation, sort of all of these different aspects of being able to integrate at the storage layer and provide the interface for end users to be able to manage their data assets across these different locations and formats?
[00:30:28] Unknown:
So there's really 3 key things that I would say the really core innovations of Comprise, and these are patented. So the first is the ability to have this kind of distributed architecture. And when we say distributed, what we mean is Compress works across different data centers, different sites, different clouds, different accounts, different buckets, different storage architectures. And it can work across all of these as a lightweight Cloud service. So there's really very little for a customer to set up to use Comprise. I mean, you just sign up and you can start using it. You just point it at your account and it starts working. It is not easy to make that happen from a technical perspective. So it's that lightweight distributed architecture is 1 of the core innovations.
The second thing is something that we call transparent move technology. And what we mean by that is I'll give you kind of a simple analogy. Let's say you're a shopkeeper and you have many things in your shop. And if you said, hey. I can come and ask you. I can say, Tobias, I want chewing gum or Tobias, I want candy. And whatever I ask you for, you will get it from wherever it is. Some things might be in your warehouse, some might be on the shelf. You will bring it to me. That's 1 way of providing data mobility. And it's a very intrusive way because every time I need something, I need to come to you. I need to ask you. And you're the only 1 who can give me something.
The other approach is I allow self serve. I have everything, and you could be in any store, and you just can search for it and get it and you can use it yourself, and I don't have to be in the way. You can use it directly. Right? Doing that, moving data so that I'm not in the path of the data access is really difficult to do. It's very, very difficult. That's what Comprise does. Comprise provides this transparent move technology where we can take a file, we can move it into an object, and it still looks like a file from the original place. It can still be used as a file from the original place, but it can be directly used as an object and can be directly manipulated without going through us. And so we are not a broker in the middle. We didn't create a new namespace you have to come to. We didn't create any bottlenecks for you. That is really hard to do, but it's extremely important to scale.
So that's the second thing. And the third thing we do is something we call the global file index. Basically, all these billions of files and objects that you have scattered everywhere, we have a central place to search. It's like a Google search for all your data, and it's very lightweight. You didn't have to do anything and you get that search. That index, making it lightweight and efficient, is technically not easy and making it actionable, and Comprise has that. So those are our 3 core innovations.
[00:33:23] Unknown:
Yeah. There are definitely a lot of interesting sort of detailed questions that I'd be happy to dig into. But at a more macro level, I'm curious what you have found to be some of the most complex challenges and considerations that you've had to engineer for and work around when you're dealing with the scale and variety of data and locations that you have to interface with?
[00:33:47] Unknown:
So the good news, I think, is that some of these standards are getting widely adopted because if that weren't the case, this problem would be a lot harder to solve. In fact, that's why it hadn't been solved before because there were no standards, you know, until about 7, 8, 10 years ago. You know, that's when standard SMB 2.0 came out and FS was accepted. And, you know, s 3 has become a de facto standard for object. All of these things happened in the last decade or so. So what is difficult about what we do? I think, you know, what is most challenging is that the system has to be really simple for someone to use, but it has to be performant in a non intrusive way. And what I mean by that is we never want anybody to notice that we're even there. We don't want your storage to be any slower because Compreze is doing some analytics or moving some data around. Right? So, you know, if a network goes down, Compreze doesn't throw up an error. It knows to retry.
If some system is unavailable, Compress knows it's a fault tolerant architecture. To create a distributed fault tolerant architecture that's non intrusive is not easy to do. And that's where our background comes in. I mean, we that's our background, distributed, scaled out, fault tolerant computing.
[00:35:11] Unknown:
In terms of some of the specific details that come up, 1 of the things that comes to mind is you have this file object. You know, maybe it's an Excel document or a Word document, and it lives in your corporate SAN. And so you then decide that you want to pull that into a Spark job to be able to pull out the data that's in this Excel file, process it, and then maybe append a new row based on the output of that. And so then you'd want that to be reflected back to the original source location and just being able to identify when there are changes to files and when they need to be written back to the sort of canonical reference location and just some of the signaling that goes into being able to register and process those event hooks across the life cycle of these data objects?
[00:36:03] Unknown:
So, yeah, that's exactly correct, Tobias. When we move files, even if you change, like, a permission on a file in 1 end, you know, being maintain all the access controls. So the same permission has to be reflected, you know, where the file was moved to, for example. And then, you know, tags, you know, no matter where you go, you know, like Amazon has a slightly different way of doing tags than Azure. Some systems have no tagging at all, like file storage. But how do we keep all that enriched metadata consistent no matter where the data goes? These are all the problems that we solve at the data management layer.
[00:36:38] Unknown:
In terms of the guarantees that you provide when somebody does say, I want to copy this file from the SAN to s 3 and then from s 3 to, you know, cold storage in AWS Glacier, or I wanna replicate it from Amazon s 3 to Google object storage. I'm wondering what types of latencies people are expecting when they make these operations and some of the ways that you're able to manage the user experience when you you do say, okay. I need to replicate this data, particularly if it's maybe a sizable file. Then also some of the questions about eventual consistency across these different storage locations versus read after write consistency and just the scheduled syncing, just some of the management of user user expectations and user experience as they perform all these, you know, potentially expensive and slow operations?
[00:37:32] Unknown:
So the first thing is that a lot of the operations we perform are kind of in the background. So for example, if you take a share or volume, if you take, like, an entire volume of files and you say, I'm gonna copy this volume into Amazon. I'm going to tier it into Glacier IR. You set the policy in Comprise. Comprise does update on what it's doing but you're still using the data. You're using it fully until you're ready for a cutover. And only that little few seconds to do the cutover is the only time when you actually see a shift. You don't see it. The DNS shifts and then you use the data from the next place. So from the user perspective, most of what we do is somewhat invisible.
They don't actually see any of these things. The IT administrator who's scheduling these things, we have an elastic architecture. So basically, sometimes you only have so much network availability. You don't want to use all of it up for this job, and it's okay for it to run-in the background and take longer. Sometimes you wanna do it much faster. So Compress has elastic parallelism. So you can actually toggle that up or down and say, hey, I want you to actually distribute this job more because I want you to run it faster. Or I actually need to go slower here, you know, run this in the evenings weekends when nobody is using the network. So Compareit actually can do all that automatically. It kind of adaptively throttles
[00:38:59] Unknown:
based on the policies you set. And another thing that you mentioned is that you have the ability to help end users understand which assets are worth inspecting when they maybe want to do some analytical task and being able to say, you know, these are the criteria that I want to be able to analyze. So then these are the specific files that might contain that information. And I'm just curious if you can talk to some of the concrete information that you're able to point to and some of the types of heuristics that you need to fall back on and ways that you validate those heuristics and evolve them as people increase their usage of Comprise and expand upon their sophistication?
[00:39:43] Unknown:
At first blush, you know, we index everything automatically based on all the available metadata because that's already there, you know, in different file systems and objects. Right? So for example, let's say you wanna find all the genomics files that you have across all your data centers that belong to a particular project. Maybe it was a source project, and it was created, you know, in this time range, regardless of which user created it. Set that query in comparison. And comparison will give you all the files. Some of them might come from your data center in Asia Pac. Some of it might be in a cloud somewhere. Some of it might be from a data center here, and some of it might be from different departments. But it would show you that entire set of data, which itself is not easy if you had to manually go and try to find that information. Right? But now maybe you want to take that list and you want to look for anything that had studies done in the Netherlands.
And so now you can feed it into an indexer which actually looks for Netherlands inside those, you know, files and then tags all of them. And so now then you can press your run the search again and say, hey. Now let me see all the ones tagged as, you know, Netherlands studies for SARS. And now, okay, I want to take that dataset and I want Compress to copy it maybe into my Databricks because I'm going to run, you know, in a in Delta Lake, I have a job that's gonna actually run, you know, Apache Spark or something. And it's gonna run some microscopic analysis on it. And it's gonna give me outputs of, you know, which of those studies actually, you know, showed a mutation or whatever. So then you can enrich the data again. So Compress kind of gives you a systematic way to search and discover, mobilize the data, enrich data, bring it back, and do this cycle over and over again. So we're not always the ones doing the in-depth analysis of the data because you have so many different indexers out there, so many different AI engines, cognitive engines. But we provide a consistent way
[00:41:51] Unknown:
to execute all of those. Does that make sense? Yes. And to the point where you are expanding on your example of collecting all of the genomics information related to these studies having to do with SARS CoV 2 over this time range, We've determined that, you know, based on the analysis, genomics information contained in these sets of files pertains to the Delta variant. These ones pertain to Omicron. And so now I want to be able to write back some of this metadata to live alongside those files. And to your point of the fact that you work with all of these different standards and open protocols, I'm curious, what are some of the limitations that folks run into when they maybe want that data to live natively inside the file object and when they need to file fall back to having that as an additional piece of metadata that lives inside comprise and just needs to be shuttled around with that file object within the comprise layer and just some of the complexities that come out there.
[00:42:51] Unknown:
We try to make it really seamless because, you know, different systems do have different capabilities. So file storage generally is good with regular metadata but not really extended. Object storage is a lot more flexible in how much you can extend metadata, but it's not always consistent in how things are done. Every environment has different ways of implementing tags, for example, and they're not really portable always. And you don't have to worry about all those because Compress kinda handles it all for you because we provide a consistent way to handle across them.
The other thing that we can do is connect to other systems. So maybe you have a lot of metadata in your lab information management system, you know, because a lot of research environments and genomics in particular might have a LIMS where they've already put a lot of metadata into the LIMS system. The problem is that nobody outside the LIMS system can access that data. But in Comprise, we have, you know, ways where we can provide a unique identifier to the objects that we're keeping no matter where, and you can join that to the data in your LIMS system. So basically, we don't want islands of data. Our job is to bridge these islands.
No matter where the data or metadata lives, you can have a consistent way to look at it.
[00:44:14] Unknown:
In terms of the capabilities that Comprise offers, what are some of the use cases or specific workflows that are either often overlooked or not leveraged to their fullest extent do you think are worth calling out?
[00:44:30] Unknown:
Typically, customers bring us in first to just know what they have. You know, sadly, most enterprises are shooting in the dark when it comes to unstructured data. They don't know, like, why it's growing at the rate it's growing. They don't know what decisions are best for their data. So the first thing we provide is the analytics and planning ability. So you can plan what your next move should be. Then, you know, Cloud data tiering, Cloud data migration is typically a very easy use case to get started with. You know, as I mentioned on the data analytics side, things like, you know, legal hold, deletion of obsolete data, deletion of ex employee data, you know, enabling research and departmental IT to do searches on data.
Those are some of the common follow on use cases that our customers use us for.
[00:45:22] Unknown:
In terms of the interesting or innovative or unexpected ways that you've seen comprised used, I'm curious if there are any notable I
[00:45:36] Unknown:
I think what I've been pleasantly surprised with is, you know, big organizations are reporting that they have saved 1, 000, 000 of dollars using Comprise. I mean, they've internally kind of looked at it. So, for example, I mean, this is a public case study we have with Pfizer. You know, actually, as they were developing the COVID vaccine, they used Comprise to manage the data across their data centers and AWS. And in the 1st year alone, they saved a few $1, 000, 000 in using the solution. And they're not alone in that. You know, I was talking to another pharmaceutical customer yesterday who said they saved $1, 000, 000 in the last 6 months using the product. And their CTO and everybody has visibility into it because of the sizable amount of savings we were able to generate.
And so that is gratifying because, I mean, that's what we expected. But to see that customers are able to recognize that value and they're able to see it and also that their users are able to use the index. I think 1 thing that is gratifying to see is, in a lot of these environments, IT brings us in. But a lot of the pull to grow comes from the users because they see the value of being able to search and find data and the fact that their data has been indexed for them. And so even though we didn't start with the users, they become really important proponents for us.
[00:47:00] Unknown:
In your experience of building and managing and growing the Comprise company and platform, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:47:12] Unknown:
Building a company is like raising children. I think every minute is unexpected, surprising, I would say. I mean, I'm very proud of our team, I would say. You know, I think the 1 thing we really learned from our last 2 companies is that it is very important, of course, to have the right solution at the right time in the market. I mean, those things go without saying. But I think what a lot of people underestimate is how important culture is in a startup organization and, you know, having like minded people, not just that they're smart, but they're equally passionate and they get along and work well as a team is extremely important.
And so we've been very conscious in how we've grown our organization. You know, it's a lot of fun. It's fun to build something together with a group of people who all want to make a difference. I would say that's been the most rewarding part of this whole journey.
[00:48:07] Unknown:
And so for people who are interested in being able to have this unified access layer and management layer for their file objects and unstructured data? What are the cases where it comprises the wrong choice?
[00:48:21] Unknown:
We don't do anything with just blocked data. So if you only have block data in the organization, if all your data is inside a database, for example, we're not a good fit.
[00:48:33] Unknown:
And as you continue to build out the platform and the capabilities and work with new partners, what are some of the things you have planned for the near to medium term future of the product?
[00:48:43] Unknown:
So the big thing we are definitely seeing with our customers is more and more applications of that global file index. A lot of our customers are on a journey towards, you know, doing more machine learning or doing more automation overall, doing better analytics on data. I feel like, you know, a lot of the innovation on data analytics has so far been more on the data warehousing side. To be honest, I think most data lakes are kind of data swamps. There's too much junk in those data lakes. It's too hard to really figure out what you have. And now that a lot of people have implemented a data warehouse, they've gotten the structured analytics, you know, behind them. For the most part, they're starting to look at unstructured data, and they're starting to think about, well, how do we make a data lake more productive? How do we use these analytics on our unstructured data? And especially, how do we feed machine learning? And so that is a really exciting area of growth. And we're investing a lot in that. We're investing a lot in provide creating blueprints, example use cases, you know, maybe even Jupyter Notebooks so it's easy to figure out how to do the execution on things. You know, we have APIs, you know, providing training around how to use them. So we're doing a lot around that, you know, enabling unstructured data analytics.
[00:50:07] Unknown:
To your point of working with data lakes, 1 of the other things I'm curious about is the types of analysis and aggregate information that you surface to end users to help them understand what they even have.
[00:50:22] Unknown:
It's funny. You know, it may sound almost elementary, but, you know, you just think about an average user who may have, like, a few 100 buckets of data. And even just a simple thing like listing all the files and finding something, If they had to manually go and list every single bucket, you know, 1 at a time, try to find things, you know, collect them, then try to figure out a way to manually move these different objects that might be in many different buckets into a new location. Every 1 of those tasks is tedious and time consuming and laborious and error prone. So at the first level, the first thing we're doing for these customers is we're giving them a single place where they can just go and just type in something and just get all their search results in 1 place, you know, a single place where they can set up policy. And based on those search results, Compress copies it or moves it or does whatever they want with it. And then giving them analytics like, hey, how much data do you actually have? You know, how is it actually being used? You know, what kinds of data is it? You know, is your data lake full of video files, or is it like, you know, are there a lot of files of another type? You know, who are the users that are using it the most? We're giving them all that kind of visibility in addition to visibility on a per object or per file basis? I mean, it's like if somebody asked, what does Google search do for the web? I mean, how would you answer that? I feel that's what we're doing for all these unstructured data lakes. Are there any other aspects of the work that you're doing at Comprise or the overall space of managing and organizing
[00:51:59] Unknown:
and categorizing these unstructured data objects that we didn't discuss yet that you'd like to cover before we close out the show? You know, I would say that, you know, the biggest areas I think you might see more action
[00:52:12] Unknown:
is kind of this cross platform way to not just search and mobilize data, but also to execute. Because, you know, if you have a Lambda function, there's no reason why it just can't be run on the data right there, you know, through an interface like ours. So the biggest thing I see is continued growth in data management as a way to search, mobilize, and execute and enrich data.
[00:52:40] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:52:57] Unknown:
I think the biggest kind of mindset that the industry has, I don't think it's coming from customers so much, the industry is mostly host based. What I mean is, you know, everybody who's selling some infrastructure is trying to add some management in that infrastructure. But because it is based on that infrastructure, it's societal. So I think the biggest change that is happening is looking at data management outside of infrastructure, looking at data management as a data problem, not an infrastructure problem.
[00:53:34] Unknown:
Absolutely. Well, thank you very much for taking the time today to join me and share the work that you're doing at Comprise. It's definitely a very interesting product working in an interesting problem space and 1 that I'm excited to see grow and scale. So I appreciate all of the time and energy that you and your team have put into that, and I hope you enjoy the rest of your day. Thank you, Tobias. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site of data engineering podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Episode Overview
Interview with Krishna Subramaniam: Introduction and Background
Understanding Unstructured Data
Comprise: Addressing Data Management Challenges
Target Users and Use Cases
Shortcomings of Storage-Centric Data Management
Complexities of Cloud Migration
Collaboration Between IT and Users
Cost Management and Data Duplication
Strategic Positioning and Partnerships
Technical Architecture of Comprise
Data Discovery and Heuristics
Use Cases and Workflows
Customer Success Stories
Lessons Learned and Company Culture
When Comprise is Not the Right Choice
Future Plans and Innovations
Data Lakes and Analytics
Future of Data Management
Conclusion and Contact Information