Manage Your Unstructured Data Assets Across Cloud And Hybrid Environments With Komprise

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's

l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Today's episode is sponsored by prophecy dot I o, the low code data engineering platform for the cloud.

Prophecy provides an easy to use visual interface to design and to deploy data pipelines on Apache Spark and Apache Airflow.

Now all the data users can use software engineering best practices.

Git tests and continuous deployment with a simple to use visual designer.

How does it work? You visually design the pipelines and prophecy generates clean spark code with tests and stores it in version control. Then you visually schedule these pipelines on airflow.

You can observe your pipelines with built in metadata search and column level lineage.

Finally, if you have existing workflows in AB Initio, Informatica, or other ETL formats that you want to move to the cloud, you can import them automatically into prophecy, making them run productively on Spark. Learn more at dataengineeringpodcast.com

slash prophecy.

Your host is Tobias Macy, and today I'm interviewing Krishna Subramaniam

about her work at Comprise to generate value from unstructured file and object data across storage formats and locations. So, Krishna, can you start by introducing yourself?

Yeah. Thanks, Tobias. Yeah. I'm a cofounder

and COO of Comprise.

We are a data management company

headquartered in Silicon Valley,

and I'm 1 of 3 cofounders of the company. And do you remember how you first got involved in working with data?

Yeah. We've been working with data nearly for 30 years now. And myself and my 2 cofounders, our background is distributed computing.

A lot of the problems

of how to manage data

can be solved with good distributed scale out architectures.

So we did 2 companies prior to Comprise,

both of which were acquired.

And then we started Comprise based on feedback from our prior customers.

So in terms of what you're building at Comprise, can you give a bit of an overview about the problems that you're trying to tackle and some of the story behind how the company came to be and why you decided that this was the problem space that you wanted to spend your time and energy on?

Yeah. You know, it's very interesting.

We all intuitively know that data growth is exploding.

You can see it in your own life. I mean, we're recording this video now. There's

video and audio that we are generating.

You're probably generating a lot of audio with all your podcasts.

You're taking a lot of pictures on your phone. You're probably going to your the doctor if they're doing an x-ray or an MRI,

Or you get in a car and you're driving it. If it's an electric car, you're generating IoT data. So you can see in your own life how much data each of us are generating.

So probably everybody knows we're generating a lot more data today than ever before.

What we may not realize is that

90% of the data

in the world today

is what we would call unstructured

data,

meaning that it's not data in a database.

It's not data organized neatly in rows and columns.

It's things like audio files, video files, genomics files, imaging files, you know, all these kinds of data.

And this is not how it used to be. You know, before 2010

or so,

most of the data in the world was actually structured.

A big shift happened

around the early 2010s.

And from then on, if you look at the growth curves that IDC and others have,

all the explosion has been in unstructured data.

And we started Comprise because many of our customers from our prior 2 companies

came to us and

said they were caught by surprise with this tremendous growth of unstructured data

because you cannot manage it the same way that you manage a database.

And so

vendors were not addressing that gap of how to manage unstructured data.

And we realized that managing unstructured data is not a storage problem.

You actually need a layer above storage,

and you need a distributed layer that works across different clouds, across different storage

that fits well with our background. That's why we started Comprice.

And to your point, there are a few different

axes

around the management of this unstructured data where 1 layer is the actual

storage location of it, where it might be block storage on spinning disks, or it might be object storage and something like s 3. Then there's also just the metadata about what are the file names, what are the attributes of the files, where are they located, and things like that. And then there's also potentially

more rich data that you can extract from those unstructured files to understand, okay, this is a PDF, and it has some graphics that I might be able to incorporate into some other product.

So there there's many different aspects of that. And I'm curious

which

aspect or aspects you're focused on addressing in the work that you're doing at Comprise?

Yeah. As you said, definitely, you know, storage of data is important because you wanna optimize it for the kind of data that you're holding, and you can address price and performance. And there's been a lot of innovation in storage, you know, from the Cloud vendors and the different storage vendors around unstructured data, you know, particularly file storage and object storage.

The piece we are addressing

is the layer above that where

how do you know

what storage is right for this data at the right time? How do you know what data to pull out of a data lake and do analytics on? You know, how do you make it easy for users to find data no matter where it lives? Because data is scattered around so many silos. Right? So

that is what we call data management.

So data management is providing a consistent,

systematic

way

to

search across all your data,

to understand all your data,

to right place data,

and to execute functions on data. And those are things

that go beyond any single storage silo because data is scattered in so many places.

That's why a separate layer

that can just give you a view of the data regardless of where it lives and mobilize that data,

you know, is required. That's what we call data management. It's analytics

and data movement and data extraction.

In terms of that

sort of discoverability and organization

capacity, there's another company that I spoke with recently called Unstruck that's focused on building a sort of metadata lake for your unstructured data sources. And I'm curious

if you're familiar with that company, and if so, how you would characterize

the work that you're doing versus the way that they're approaching the problem. I am not familiar with those guys in particular, but I will tell you that there are data lakes, Like, you know, Azure itself has a data lake. Amazon has a data lake. I think the problem

that a lot of companies run into

is that, you know, if the data is not optimized in some way and unstructured data doesn't have any specific format to it, it has all kinds of different formats,

then how do you know what to find? How do you know even where to look?

So a lot of what Comprise does is you just point Comprise at any of your cloud accounts and any of your data centers.

Compress finds all the data in all these places.

It indexes all the data for you. It gives you analytics on all the data.

And we didn't have to move it into our own metadata lake. We didn't do anything like that. Your data is wherever it is, but it's giving you a view of everything.

And then you can search. And you could say, oh, I see that this data hasn't been touched in over a year, and it's consuming Flash storage, that's a mismatch. Let me put that in the Cloud. Or I know that for my legal hold, I need to keep all the data from this user related to this project

in an object locked bucket for 5 years. Let me have Compress move it there and lock it for 5 years. After 5 years, I can delete it. So we enable

discoverability of data wherever it lives. And then systematically,

by policy, we move the data.

And then we also enrich the data with tags. So basically, we're providing

kind of an index, if you will, an actionable index

across different

repositories.

Yeah. The note of being able to manage the life cycling of the data and being able to apply a policy to say this source of information has this compliance regimen that we need to follow. So we need to make sure that we maintain it for x number of years before we can actually delete it, and we don't have to worry about,

you know, lack of institutional memory or

fat fingering a particular command to risk accidentally deleting data that we have to maintain for legal purposes.

Yes. Exactly. Yeah. We believe in something called actionable analytics.

You know, there's probably different ways you might be able to get analytics on your data or information about your data. But, you know, at the space of unstructured data, a petabyte of data is probably a couple of 1, 000, 000, 000 files. And if you have 10 to 100 terabytes of data, you're dealing with, like, 100 of billions of files.

There's absolutely no way anybody giving you just analytics

will help you much because for you to take some action on it, it's cumbersome.

That's why Comprise

does analytics, indexing, and mobility

in a single solution.

To your point, you could just set a policy and say, hey, this data is for legal hold. I'm putting it here. 5 years later, confine it. And you just set the policy once. Compreze will do it all for you. You don't have to worry that people didn't remember or something happened. Compress will put it in that bucket. 5 years later, it will pull it out of there, put it into your trash bucket, and ask you, hey, are you ready to delete it? You know, so all of that will automatically happen.

In terms of the target

users and use cases that you're focused on, I'm curious

how you think about the personas of either the

individual

end users and roles that they fulfill or the sort of categories of organization

that would benefit most from the product that you're building?

Yeah. It's a great question. So we address 2 sets of use cases.

1 use case is more on the infrastructure side. So typically,

we go in to somebody who owns

the infrastructure budget for data. And so it might be like a VP of IT infrastructure,

VP of cloud infrastructure, or VP of global storage. You know, depending on the company, those roles titles might be different. But, basically, they're looking at the cost of their storage and data protection and data management.

They're looking at cloud transformation.

They need a way to cut 60, 70 percent of cost while handling data growth. And then they need a way to make themselves more agile and deliver data as a service.

And so that's who we go into

and they're usually our champions in an organization.

And what they do is they see the value of this index

and they bring in their departmental IT. They bring in their legal IT. You know, they bring in these other teams

that are able to use the data for legal hold or for, you know, feeding big data and machine learning analytics or for doing things like, you know, deleting obsolete data. That's another problem a lot of these companies have. Obsolete data can be a liability in a company. So identifying it proactively is important.

So the short answer is

infrastructure IT or cloud IT is usually the starting point for us.

As far as the

ways that these

sort of enterprise IT and, you know, data management

teams within these organizations

are handling the problem of file storage, file indexing,

you know, organization,

categorization,

some of these access controls and life cycle policies.

What are some of the

general approaches that they have typically been using to tackle this range of problems related to this storage of, you know, unstructured data and file objects and data assets?

Yeah. That's a great question, Tobias, because, you know, when unstructured data's footprint was pretty small,

you know, basically,

you could just rely on your storage vendor

to handle it for you. If you have very little data and you just have 1 storage vendor, 1 architecture

in 1 place, you kind of work out a good price with them, and the problem is just not big enough to go and worry about anything else.

And that has significantly

changed today

because data

outlives storage.

In most enterprises,

data has a minimum lifespan of 25 to 30 years,

sometimes much longer.

And most of your

storage purchases or even Cloud contracts, you're looking at a 3 to 5 year time frame.

So your data is going to go through several iterations of storage, several iterations of backup. Technology is gonna change significantly in 30 years. So do you really wanna be locked into

all these silos of data management?

And more importantly,

do you want to have low visibility into your data across this? Because increasingly,

it's not just about storing data. It's about maximizing the use of that data 30 year lifespan.

The requirements have changed significantly.

The problem has gotten more complex.

There are way more options available to customers.

And they are being asked to deliver a data service,

not a storage service.

So that's why,

you know, data management needs to be independent from storage.

As to the shortcomings

that exist when you are relying on the storage layer to be the kind of gatekeeper for these use cases and these requirements,

What are some of the issues that come up when in terms of the, you know, wasted effort or

issues with failing to comply

with your regulatory

requirements or

just issues with

losing data because you don't know where it is or maybe it gets deleted accidentally? Just some of the issues that arise because of the fact that you don't have this more holistic approach to the data storage and management.

There's several things that you kind of highlighted there. The biggest, I think, is that there's a lot of missed savings opportunities,

And there's a lack of flexibility.

You know, just imagine,

like, you're recording from the studio,

and you probably might go to the kitchen to go get get a meal at some point and you're free to move around your home. Right? And you're free to go anywhere and you might leave the house at some point. And, that's because you're autonomous and you're not controlled by the house that you're living in. For data, it's not like that. Right? Data is put in storage and the storage is trying to move you around.

And, you know, it's trying to tell you that you can't leave this. If you leave it, you know, there's a cost. You know, all the data has to be rehydrated.

And so that's not ideal.

So from a customer perspective,

customers lose about 60 to 70% of the savings

that they could get if they had a data management solution

versus just storage centric because there are way more options available to them. So as a simple example, if you put data in a cloud file storage,

the file storage may only have limited number of tiers,

but the Cloud may have 30, 40 tiers available to you. And some of those tiers are, you know, orders of magnitude less expensive than the tiers in the file storage.

So data management will move your data to those cheaper tiers, But the file storage solutions won't. They'll only keep you inside the file storage tiers because they don't make money if they move let you move out of their environment.

So there's a lock in that basically

you get locked in to higher costs

and you get locked out of native capabilities of other platforms.

And as to this cloud migration aspect,

as you mentioned, a lot of these

enterprise organizations

and traditional IT are going to be dealing with storage vendors, or maybe they've got, you know, a large SAN array or a NAS that they're dealing with, or they're dealing with a flash array storage system.

And as they're starting to move to the cloud because different business units have different operational requirements, they wanna be able take advantage of the elasticity, or they need to be able to

iterate quickly, and they don't wanna have to deal with some of the acquisition times required to, you know, rack and stack new servers.

What are some of the

complexities that come about when they are trying to

span across these different operating environments where they have these on premise data centers, they have the in house knowledge and capacity to manage some of those storage solutions, but then they also need to be able to manage

a unified interface or unified access layer and governance of this data as it moves into these, you know, private cloud or public cloud environments,

how does that manifest as far as

the scope of responsibility

as to who owns that problem

and how they might approach the sort of collaboration

and enforcement of those different requirements.

It's a very interesting point because, you know, typically,

IT is a custodian

of data. You know, IT is asked to store data, protect it, move it to the cloud, and do all those things, but they are not the users of the data. So

department

users, line of business users are the users of the data. They are the ones that know what should actually happen with the data, what data is actually important to them, what data, you know, they want

to use where. Right?

And today, you know, it is very

imperfect

because

when you manage data through silos

and you lack visibility to data,

all decisions are ad hoc. You know, somebody comes and tells you, hey, I need the best storage for my data because I'm gonna run this massive job. You provision, like, expensive Flash storage in the cloud or whatever for them. And then, you know, they're done with that analysis in a month that they never tell you because, you know, that's not their job. They moved on to something else, and that data is still consuming all those expensive resources. You know, so for IT, the challenge, I think, is

how do they

provide an environment that's flexible enough

that

different departments

and different users

can have different policies for their data based on their needs?

And yet, how can IT have central visibility

across Cloud and data center

into what they're

doing and how data is being managed

so they can enforce things at a central level and and govern at a central level but not get in the way of users using the data. Because ultimately,

it's all about improving the productivity of the users.

And again, that's where data management is important because a system like ours, for example,

we basically move data without getting in the way. We never get in the path of the data.

Users think the data is still there. They can use it from wherever.

All their applications continue to work, whether it's in the cloud or not,

because we manage kind of providing that transparency. We have a patented way of doing that across file and object storage.

And so we give IT the visibility into, hey, where is data? Who's using it? How much it's costing? Where do you want to put it? All those things IT can manage,

but they are not getting the way of users. For users, users that have full access to their data, users can set, you know, rules on what they want, and they can search for data.

That all is completely available to them. So it's a way for IT to collaborate

with users

rather than have this friction between

IT needs and user needs, which are not always,

you know, the same priorities.

StreamSets' DataOps platform is the world's 1st single platform for building smart data pipelines across hybrid and multi cloud architectures.

Build, run, monitor, and manage data pipelines confidently with an end to end data integration platform that's built for constant change.

Amp up your productivity with an easy to navigate interface and hundreds of prebuilt connectors, and get pipelines into new hires up and running quickly with powerful, reusable components that work across batch and streaming.

Once you're up and running, your smart data pipelines are resilient to data drift, those ongoing and unexpected changes in schema, semantics, and infrastructure.

Finally, 1 single pane of glass for operating and monitoring all of your data pipelines,

the full transparency and control you desire for your data operations.

Get started building pipelines in minutes for free at data engineering podcast.com/streamsets.

The first 10 listeners that subscribe StreamSets professional tier will receive 2 months free after their 1st month.

To the point of

cost management

and visibility, 1 of the other challenges that always comes up when you're trying to manage organization and cleanup of data is the issue of duplication or understanding,

you know, who created what when.

And particularly from the duplication and replication

standpoint, I'm curious how you have approached that challenge of being able to say,

okay. I have this file. It lives in this flash array on premise in this data center, but now I need to be able to

access it from, you know, this cloud service to be able to do some, you know, machine learning algorithm on it. And so I need to be able to copy it over to this s 3 storage

location to be able to merge it with these other data assets. But then when I'm done with it, I wanna make sure it gets deleted. And just some of those

complexities that come up when you do have to move data around multiple places and managing some of the latency requirements around it, the cleanup afterwards to make sure that you only have 1 sort of canonical source of truth for a data asset. And

in the event that that data asset changes and it's still being used by some other system, making sure that that information gets replicated to those different places.

Yeah. No. That's great. It's a very important use case, actually.

Exactly. So we allow you to, like, search and find data

and have comprised you know, maybe copy that data to a location. We have something called deep analytics actions

and then manage the ongoing life cycle of that. Because to your point, otherwise there's that other copy and it's just sitting there consuming resources forever

because nobody thinks about it after they finish their analysis.

I mean, that's 1 thing we as humans, we're naturally we're hoarders. Right? I'm a hoarder. I never delete anything.

And, you know, that's our nature. And and it's not our job all the time to ask users to go running around cleaning things up. That's a waste of their time.

That's where systematic processes should take over and should be able to handle it for you. So Compress does that. It not only

will make that copy of the data,

but based on the policies you set, it will also do the cleanup of that. And

if you did do some analytics in that other environment,

we have ways to actually enrich the original data with tags.

So all the output of that

analytics

doesn't get lost in the copy. It can be put back on the original

source of

truth. So not only do we maintain the consistency of the data, but also we're continuously enriching data as it goes through different

processing. Yeah. And, you know, the other use case you didn't mention

but is also very important

is that

most organizations by design

keep multiple copies of data because either for ransomware reasons or disaster recovery reasons or backup. You know, if some suddenly a virus hits and you would go to an older version. But not all data needs 5 to 10 copies. Because if it's never been touched in a year or more,

there's a cheaper way of getting that protection.

And so Comprise actually rightsizes the backup for you. So you can cut 80%

of the backup footprint and cost

by just doing a more passive

way of leveraging durable solutions like object storage and object locking

on cold data

and then using the high end backup for hot data.

So CompRez does all those things so that you have affordable backup and affordable ransomware protection.

Particularly with ransomware, the costs are skyrocketing

for companies.

Yeah. Those are definitely

useful and interesting problems. And most of the time when people are talking about big data and massive analytics, they're usually ignoring the question of backup because, you know, if you have to have 2 copies of petabytes or exabytes worth of data, then

that's a massive problem to solve. And so a lot of people just don't bother solving it. And so it's definitely interesting to think about some of the different ways that you can address that problem where maybe you'd have to have 2 copies of the data everywhere. You just need to make sure that that 1 copy that you have is sufficiently protected so that it's never going to get deleted or modified.

Exactly. Or you use, like, geodispersion

within the storage. You know, you use a second zone within it. If there are lower cost ways

of getting

protection,

it's not the right answer for all data, for passive data that has never

changed in a long time, you know, you can right size that protection.

Before we start to dig into the technical aspects of what you're building, another sort of business level

consideration that I'm interested in is the way that you think about the strategic positioning of the comprised product in relation to all the different players in the ecosystem where you have the different storage vendors, you've got the cloud providers. And looking at your website, I noticed, you know, we have partnerships with I think it was Amazon,

Microsoft,

Google,

you know, a whole bunch of different vendors. And so I'm curious how you think about that positioning of being this utility layer that is agnostic to that storage underpinnings

and being able to play nicely across the whole ecosystem?

At Avaya, you know, at a technical level, we are believers of being standards based. Everything we do is standards based and the interfaces are all nonproprietary,

meaning that we work with standard file protocols, object protocols.

We put data in native form in all the storage. The reason,

you know, all the cloud providers like to work with us is because when we move data to their Cloud, we

put it in the native Cloud format. We don't lock it into a compressed format.

So you can directly go and use your data in the Cloud. You can directly run Redshift in Amazon. You can directly run Azure Analytics in your Azure bucket.

You can do all of that without going through Comprise or without going through your file storage.

We believe

that

data should be in the control of the customer. It's their data, and they should be able to maximize the use of the data wherever it goes. You know, how do we partner with these different vendors?

You know, we work very closely with the major Cloud and file storage providers

and we actually expose

a lot of their key capabilities

in a simple way for customers.

So as an example,

you know, Amazon has

currently

almost 16 different tiers and classes

of file and object storage.

And we are the only ones who can leverage all of them. So we can put a file

into Amazon FSx

and when it's no longer being actively used, we can tier it transparently to Glacier Instant Retrieval,

which is about 40 times cheaper

than the Flash layer of FSx.

And then

because we've indexed that data and we keep it in native

s 3 format,

we can promote that data back up into

an EC 2 instance

and run a data, you know, red on it or run Snowflake on it. So we can we move the data up and down in different directions

consistently

without lock in. And so for the Cloud providers,

we provide a simple way for the users to use the richness of the services they provide

because it's actually overwhelming

how many services you can get from just 1 cloud.

And if you manually

write things to use each of those services,

innovating at a breakneck speed, and it's impossible for customers to keep up sometimes.

Digging into the technical layers, I'm wondering if you can talk through some of the ways that the Comprise platform is actually architected

and the

technical components that you're using to be able to provide the, you know, discovery, search,

transportation, sort of all of these different aspects of being able to integrate at the storage layer and provide the interface for end users to be able to manage their data assets across these different locations and formats?

So there's really 3 key things that I would say the

really core innovations of Comprise, and these are patented.

So the first is the ability to have this kind of distributed

architecture. And when we say distributed, what we mean is Compress works across

different data centers, different sites, different clouds, different accounts, different buckets,

different storage architectures.

And it can work across all of these as a lightweight

Cloud service.

So there's really very little for a customer to set up to use Comprise. I mean, you just sign up and you can start using it. You just point it at your account and it starts working. It is not easy to make that happen from a technical perspective.

So it's that lightweight distributed architecture

is 1 of the core innovations.

The second thing is something that we call transparent

move technology.

And what we mean by that is I'll give you kind of a simple analogy.

Let's say you're a shopkeeper and you have many things in your shop. And if you said, hey. I can come and ask you. I can say, Tobias, I want chewing gum or Tobias, I want candy. And whatever I ask you for, you will get it from wherever it is. Some things might be in your warehouse, some might be on the shelf. You will bring it to me. That's 1 way of providing

data mobility. And it's a very

intrusive way because every time I need something, I need to come to you. I need to ask you. And you're the only 1 who can give me something.

The other approach

is I allow self serve. I have everything, and you could be in any store,

and you just can search for it and get it and you can use it yourself, and I don't have to be in the way. You can use it directly. Right?

Doing that, moving data so that I'm not in the path of the data access

is really difficult to do. It's very, very difficult.

That's what Comprise does. Comprise provides this transparent move technology

where we can take a file,

we can move it into an object,

and it still looks like a file from the original place. It can still be used as a file from the original place, but it can be directly used as an object and can be directly manipulated without going through us. And so we are not a broker in the middle. We didn't create a new namespace you have to come to. We didn't create any bottlenecks for you.

That is really hard to do, but it's extremely important to scale.

So that's the second thing. And the third thing we do is something we call the global file index.

Basically, all these billions of files and objects that you have scattered everywhere,

we have a central place to search. It's like a Google search for all your data,

and

it's very lightweight. You didn't have to do anything and you get that search.

That index, making it lightweight and efficient,

is technically not easy and making it actionable,

and Comprise has that. So those are our 3 core innovations.

Yeah. There are definitely a lot of interesting

sort of detailed questions that I'd be happy to dig into. But at a more macro level, I'm curious what you have found to be some of the most

complex

challenges and considerations that you've had to engineer for and work around when you're dealing with the

scale

and variety

of data and locations that you have to interface with?

So the good news, I think, is that some of these standards are getting widely adopted

because if that weren't the case,

this problem would be a lot harder to solve. In fact, that's why

it hadn't been solved before because

there were no standards, you know, until about 7, 8, 10 years ago. You know, that's when standard SMB 2.0 came out and FS

was accepted.

And, you know, s 3 has become a de facto standard for object.

All of these things happened in the last decade or so. So what is difficult about what we do? I think, you know, what is most challenging

is that

the system has to be really simple for someone to use,

but it has to be

performant

in a non intrusive way. And what I mean by that is

we never want anybody to notice that we're even there. We don't want your storage to be any slower

because Compreze is doing some analytics or moving some data around. Right? So, you know, if a network goes down, Compreze doesn't throw up an error. It knows to retry.

If some system is unavailable,

Compress knows it's a fault tolerant

architecture.

To create a distributed fault tolerant architecture

that's non intrusive

is not easy to do. And that's where our background comes in. I mean, we that's our background, distributed, scaled out, fault tolerant computing.

In terms of some of

the specific details that come up, 1 of the things that comes to mind is you have this

file object.

You know, maybe it's an Excel document or a Word document, and it lives in your

corporate

SAN. And so you then decide that you want to

pull that into

a Spark job to be able to pull out the data that's in this Excel file,

process it, and then maybe append a new row based on the output of that. And so then you'd want that to be reflected back to the original source location and just

being able to

identify when there are changes to files and when they need to be written back to the sort of canonical reference location and just some of the signaling that goes into being able to

register and process those event hooks across the life cycle of these data objects?

So, yeah, that's exactly correct, Tobias. When we move files, even if you change, like, a permission on a file in 1 end, you know, being maintain all the access controls. So the same permission has to be reflected,

you know, where the file was moved to, for example.

And then, you know, tags, you know, no matter where you go, you know, like Amazon has a slightly different way of doing tags than Azure. Some systems have no tagging at all, like file storage. But how do we keep all that

enriched metadata consistent no matter where the data goes? These are all the problems that we solve at the data management layer.

In terms of the

guarantees that you provide when somebody does say, I want to copy this file from the SAN to s 3 and then from s 3 to,

you know, cold storage in AWS Glacier, or I wanna replicate it from Amazon s 3 to Google object storage.

I'm wondering what types of latencies

people are expecting when they make these operations and some of the ways that you're able to

manage the user experience

when you you do say, okay. I need to replicate this data, particularly if it's maybe a sizable file.

Then also some of the questions about eventual consistency across these different storage locations versus read after write consistency and just the scheduled syncing, just some of the management of user user expectations and user experience as they perform all these, you know, potentially

expensive and slow operations?

So the first thing is that a lot of the operations we perform

are

kind of in the background. So for example, if you take a share or volume, if you take, like, an entire volume of files and you say, I'm gonna copy this volume into Amazon. I'm going to tier it into Glacier

IR. You set the policy in Comprise.

Comprise does update on what it's doing but you're still using the data. You're using it fully

until you're ready for a cutover.

And only that little

few seconds to do the cutover

is the only time when you actually see a shift. You don't see it. The DNS shifts and then you use the data from the next place. So from the user perspective,

most of what we do is somewhat invisible.

They don't actually see any of these things. The IT administrator who's scheduling these things, we have an elastic architecture.

So basically, sometimes

you only have so much network availability.

You don't want to use all of it up for this job, and it's okay for it to run-in the background and take longer.

Sometimes you wanna do it much faster.

So Compress has elastic parallelism.

So you can actually toggle that up or down and say, hey, I want you to actually distribute this job more because I want you to run it faster.

Or I actually need to go slower here, you know, run this in the evenings weekends when nobody is using the network. So Compareit actually can do all that automatically. It kind of adaptively throttles

based on the policies you set. And another thing that you mentioned is that you have the ability to help

end users understand

which assets are worth

inspecting when they maybe want to do some analytical task and being able to say, you know, these are the criteria that I want to be able to

analyze.

So then these are the specific files that might contain that information.

And I'm just curious if you can talk to some of the

concrete information that you're able to point to and some of the types of heuristics that you need to fall back on and ways that you validate those heuristics and evolve them as

people increase their usage of Comprise and expand upon their sophistication?

At first blush, you know, we index everything automatically based on all the available metadata because that's already there, you know, in different file systems and objects. Right? So for example, let's say you wanna

find all the genomics files that you have

across all your data centers

that belong to a particular project. Maybe it was a source project,

and it was created, you know, in this time range, regardless of which user created it. Set that query in comparison. And comparison will give you all the files. Some of them might come from your data center in Asia Pac. Some of it might be in a cloud somewhere. Some of it might be from a data center here, and some of it might be from different departments. But it would show you that entire

set of data, which itself is not easy if you had to manually go and try to find that information. Right? But now maybe you want to take that list and you want to look for anything that had studies done in the Netherlands.

And so now you can feed it into

an indexer which actually looks for Netherlands inside those, you know, files

and then tags all of them. And so now then you can press your run the search again and say, hey. Now let me see all the ones tagged as, you know, Netherlands studies

for SARS.

And now, okay, I want to take that dataset

and I want Compress to copy

it maybe into my Databricks because I'm going to run, you know, in a in Delta Lake, I have a job that's gonna actually run, you know, Apache Spark or something. And it's gonna run some microscopic analysis on it. And it's gonna give me outputs of, you know, which of those studies actually, you know, showed a mutation or whatever. So then you can enrich the data again. So Compress kind of gives you a systematic

way

to search

and discover,

mobilize the data,

enrich data,

bring it back, and do this cycle over and over again. So we're not always the ones

doing the in-depth analysis of the data because you have so many different indexers out there, so many different AI engines, cognitive engines. But we provide a consistent way

to execute all of those. Does that make sense? Yes. And to the point where you are expanding on your example of collecting all of the genomics information related to these studies having to do with SARS CoV 2 over this time range,

We've determined that, you know,

based on the analysis,

genomics information contained in these sets of files pertains to the Delta variant. These ones pertain to Omicron. And so now I want to be able to write back some of this metadata

to live alongside those files.

And to your point of the fact that you work with all of these different standards and open protocols, I'm curious,

what are some of the

limitations that folks run into when they maybe want that data to live natively inside the file object and when they need to file fall back to having that as an additional piece of metadata that lives inside comprise and just needs to be shuttled around with that file object within the comprise layer and just some of the complexities that come out there.

We try to make it really seamless because, you know, different systems do have different capabilities.

So

file storage generally is good with regular metadata but not really extended.

Object storage is a lot more flexible in how much you can extend

metadata, but it's not always consistent in how things are done. Every environment has different ways of implementing tags, for example,

and they're not really portable always.

And you don't have to worry about all those because Compress kinda handles it all for you because we provide a consistent

way to handle across them.

The other thing that we can do

is connect to other systems. So maybe you have a lot of metadata

in your lab information management system, you know, because a lot of research environments and genomics in particular might have a LIMS where they've already put a lot of metadata into the LIMS system.

The problem is that nobody outside the LIMS system can access that data.

But in Comprise, we have, you know, ways where we can provide a unique identifier to the objects that we're keeping no matter where, and you can join that to the data in your LIMS system. So basically,

we

don't want islands of data. Our job is to bridge these islands.

No matter where the data or metadata

lives, you can have a consistent way to look at it.

In terms of the capabilities

that Comprise offers, what are some of the

use cases

or

specific workflows

that are either often overlooked or

not leveraged to their fullest extent do you think are worth calling out?

Typically, customers bring us in first to just know what they have. You know, sadly,

most enterprises are shooting in the dark when it comes to unstructured data. They don't know, like, why it's growing at the rate it's growing. They don't know what decisions

are best for their data. So the first thing we provide is the analytics and

planning

ability. So you can plan what your next move should be.

Then, you know, Cloud data tiering, Cloud data migration

is typically a very easy use case to get started with. You know, as I mentioned on the data analytics side, things like, you know, legal hold,

deletion of obsolete data, deletion of ex employee data, you know,

enabling research and departmental

IT to do searches on data.

Those are some of the common follow on use cases that our customers

use us for.

In terms of

the interesting or innovative or unexpected ways that you've seen comprised used, I'm curious if there are any notable

I

I think what I've been pleasantly surprised with is,

you know, big organizations

are reporting that they have saved 1, 000, 000 of dollars using Comprise. I mean, they've internally kind of looked at it. So, for example,

I mean, this is a public case study we have with Pfizer. You know, actually, as they were developing the COVID vaccine,

they used Comprise

to

manage the data

across their data centers and AWS.

And in the 1st year alone,

they saved a few $1, 000, 000 in using the solution.

And they're not alone in that. You know, I was talking to another pharmaceutical customer yesterday who said they saved $1, 000, 000 in the last 6 months using the product. And their CTO and everybody has visibility into it because of the sizable

amount of savings we were able to generate.

And so that is gratifying because, I mean, that's what we expected. But to see that customers are able to recognize that value and they're able to see it and also that their users are able to use the index.

I think 1 thing that is gratifying to see is, in a lot of these environments, IT brings us in.

But a lot of the pull to grow comes from the users because they see the value of being able to search and find data and the fact that their data has been indexed for them. And so even though we didn't start with the users,

they become really important proponents for us.

In your experience

of building and managing and growing the Comprise

company and platform, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

Building a company is like raising children. I think every minute is unexpected,

surprising,

I would say. I mean, I'm very proud of our team, I would say. You know, I think the 1 thing we really learned from our last 2 companies is that

it is very important, of course, to have the right solution at the right time in the market. I mean, those things go without saying.

But I think what a lot of people underestimate is how important culture is in a startup organization

and, you know, having like minded people,

not just that they're smart, but they're equally passionate and they get along and work well as a team

is extremely important.

And so we've been very conscious in how we've grown

our organization.

You know, it's a lot of fun. It's fun to build something together

with a group of people who all want to make a difference.

I would say that's been the most rewarding part of this whole journey.

And so for people who are interested in being able to have this

unified access layer and management layer for their file objects and unstructured data? What are the cases where it comprises the wrong choice?

We don't do anything with just blocked data. So if you only have block data in the organization, if all your data is inside a database, for example,

we're not a good fit.

And as you continue to build out the platform and the capabilities

and work with new partners, what are some of the things you have planned for the near to medium term future of the product?

So the big thing we are definitely seeing with our customers

is more and more applications of that global file index.

A lot of our customers are on a journey towards,

you know, doing more machine learning

or doing more automation overall,

doing better analytics on data.

I feel like, you know, a lot of the innovation on data analytics

has so far been more on the data warehousing side. To be honest, I think most data lakes are kind of data swamps. There's too much junk in those data lakes. It's too hard to really figure out what you have.

And now that a lot of people have implemented a data warehouse,

they've gotten the structured

analytics,

you know, behind them. For the most part, they're starting to look at unstructured data, and they're starting to think about, well, how do we make a data lake more productive? How do we use these analytics on our unstructured data? And especially, how do we feed machine learning? And so that is a really exciting area of growth. And we're investing a lot in that. We're investing a lot in

provide creating blueprints, example use cases, you know, maybe even Jupyter Notebooks so it's easy to figure out how to do the execution on things. You know, we have APIs, you know, providing

training around how to use them. So we're doing a lot around that, you know, enabling unstructured data analytics.

To your point of working with data lakes, 1 of the other things I'm curious about is the

types of

analysis

and aggregate information

that you surface to end users to help them understand what they even have.

It's funny. You know, it may sound almost elementary,

but, you know, you just think about an average user

who may have, like, a few 100 buckets of data.

And even just a simple thing like listing all the files and finding something, If they had to manually go and list every single bucket, you know, 1 at a time, try to find things, you know, collect them, then try to figure out a way to manually move these different objects that might be in many different buckets into a new location.

Every 1 of those tasks is tedious and time consuming and laborious and error prone.

So at the first level, the first thing we're doing for these customers is

we're giving them a single place where they can just go and just type in something and just

get all their search results in 1 place, you know, a single place where they can set up policy. And based on those search results, Compress copies it or moves it or does whatever they want with it. And then giving them analytics like, hey, how much data do you actually have? You know, how is it actually being used? You know, what kinds of data is it? You know, is your data lake full of video files, or is it like, you know, are there a lot of files of another type? You know, who are the users that are using it the most? We're giving them all that kind of visibility

in addition to visibility on a per

object or per file basis? I mean, it's like if somebody asked, what does Google search do for the web? I mean, how would you answer that? I feel that's what we're doing for all these unstructured data lakes. Are there any other aspects of the work that you're doing at Comprise or the overall space of managing and organizing

and

categorizing

these unstructured data objects that we didn't discuss yet that you'd like to cover before we close out the show? You know, I would say that, you know, the biggest areas I think you might see more action

is kind of this cross platform way to not just search and mobilize data, but also to execute.

Because, you know, if you have a Lambda function, there's no reason why it just can't be run on the data right there, you know, through an interface like ours.

So the biggest thing I see is

continued growth in data management as a way

to search, mobilize,

and execute and enrich data.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes.

And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I think the biggest kind of mindset

that the industry has, I don't think it's coming from customers so much,

the industry is mostly

host based. What I mean is, you know, everybody who's selling some infrastructure

is trying to add some management in that infrastructure.

But because it is based on that infrastructure,

it's societal.

So I think the biggest change

that is happening

is

looking at data management

outside of infrastructure,

looking at data management as a data problem, not an infrastructure problem.

Absolutely.

Well, thank you very much for taking the time today to join me and share the work that you're doing at Comprise. It's definitely a very

interesting product working in an interesting problem space and 1 that I'm excited to see grow and scale. So I appreciate all of the time and energy that you and your team have put into that, and I hope you enjoy the rest of your day. Thank you, Tobias.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of data engineering podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links