Summary
Unstructured data takes many forms in an organization. From a data engineering perspective that often means things like JSON files, audio or video recordings, images, etc. Another category of unstructured data that every business deals with is PDFs, Word documents, workstation backups, and countless other types of information. Aparavi was created to tame the sprawl of information across machines, datacenters, and clouds so that you can reduce the amount of duplicate data and save time and money on managing your data assets. In this episode Rod Christensen shares the story behind Aparavi and how you can use it to cut costs and gain value for the long tail of your unstructured data.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
- Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
- Your host is Tobias Macey and today I’m interviewing Rod Christensen about Aparavi, a platform designed to find and unlock the value of data, no matter where it lives
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Aparavi is and the story behind it?
- Who are the target customers for Aparavi and how does that inform your product roadmap and messaging?
- What are some of the insights that you are able to provide about an organization’s data?
- Once you have generated those insights, what are some of the actions that they typically catalyze?
- What are the types of storage and data systems that you integrate with?
- Can you describe how the Aparavi platform is implemented?
- How do the trends in cloud storage and data systems influence the ways that you evolve the system?
- Can you describe a typical workflow for an organization using Aparavi?
- What are the mechanisms that you use for categorizing data assets?
- What are the interfaces that you provide for data owners and operators to provide heuristics to customize classification/cataloging of data?
- How can teams integrate with Aparavi to expose its insights to other tools for uses such as automation or data catalogs?
- What are the most interesting, innovative, or unexpected ways that you have seen Aparavi used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Aparavi?
- When is Aparavi the wrong choice?
- What do you have planned for the future of Aparavi?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to data engineering podcast.com/ lunode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show. This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer friendly data catalog for the modern data stack. Open source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga, and others.
Acryl Data provides DataHub as an easy to consume SaaS product, which has been adopted by several companies. Sign up for the SaaS today at data engineering podcast.com / acryl. That's acryl. Your host is Tobias Macy. And today, I'm interviewing Rod Christensen about Aparavi, a platform designed to find and unlock the value of data no matter where it lives. So, Rod, can you start by introducing yourself? Yes. I'm Rod Christiansen from Abravi
[00:01:33] Unknown:
and, CTO of and founder.
[00:01:37] Unknown:
And do you remember how you first got started working in data?
[00:01:40] Unknown:
Oh my goodness. Do you want me to go all the way back? So I wrote my first backup program when I was 19, which was now I'm gonna date myself. That was actually for CPM. If any of you remember that, NPM. So that's a long time ago. Got been in the backup business for, what, about 30 years now, specialized in storage. And then the cloud was invented way, way back. And, you know, I had to adapt, backup and storage technologies for the cloud. And just basically been in in the storage business for a long, long time.
[00:02:11] Unknown:
And so in terms of the Aparavi product, I'm wondering if you can just describe a bit about what it is that you're building there and some of the story behind how it came to be and why you decided that this was a problem that you wanted to spend your time and energy on.
[00:02:25] Unknown:
I've worked for a lot of different companies. I've worked for CI. I've worked for Usimni Technologies. I've worked for Symantec. I mean, a bunch of companies focusing on backup. And so I have a lot of experience in backup. But about 5 years ago, my cofounder Adrian and I were sitting there talking, and it's like, the way we were doing things in backup was just really not sustainable over the long term. The constant, you know, refrain of, you know, storage guys who actually end up managing storage is to just throw more discs at it. And, you know, at the end of the day, that's not sustainable. Eventually, we're gonna run out of, you know, magnetic material to put on these discs to store all this crap that we're creating. So, you know, we actually sat down with a blank piece of paper, and we re architected and figured out what we could actually do that was different, unique. You know? And that's where the whole idea of bringing in knowledge into it and not just backing up bits, but, you know, and copying bits and throwing bits up into the cloud. But really to actually intelligently figure out what we had and not do it in a manual fashion, which is also unscalable and unsustainable.
But really, you know, figure out and automate how we can do this and how we can manage storage
[00:03:37] Unknown:
in a very cost effective manner. As far as the types of organizations and customers that you're working with and the problems that you're focused on solving, I'm wondering if you can talk to some of the ways that that influenced your approach to the problem and the ways that you think about how you speak about the problem to them.
[00:03:57] Unknown:
When we designed it, we really didn't want to be a point product. We didn't want to write an ediscovery product. We did not want to write a, you know, data management product that just, you know, put bits around. Really, we wanted to write a product that could be of general use to allow the user to understand and know their data in a more intelligent fashion. So when we talk about data management, the our first focus is actually, you know, understanding your data, making it visible. Not only just, you know, by file name or by type and by size and when it was created and etcetera, etcetera. It's mostly about what is this data? You know, what is the content of the data? What does the content of the data actually represent? And how do we find that out? So 1 of the things that we do is we pull the text out of all the different data files that we're looking at, managing, and controlling, read the text out of it, classify it, and index it so we can very easily search it. Then you can do really, really cool stuff with it. Like, you know, find all the files that have, you know, the the word red and blue, or the files that have red near blue and green in it.
So that's just on search. But when you talk about classification, you know, you really need to boil your data down into the or the rot, the redundant, you know, outdated and trivial stuff with the really important stuff. And that's where the classification stuff comes in or classification algorithms come in to actually try and figure out, you know, what this data is. You know, is it a HIPAA document? Does it have patient codes or diagnostic codes in it? Or does it have, you know, Social Security numbers in it or bank account numbers? Or is it an SEC form? Or is it a tax form, etcetera? So that's really what we're really focused on is knowing your data first. Now once you know your data, you know, once you understand that content of the data, whether it's, you know, 10 files or a 1000000000 files, really doesn't matter, then you can actually take and automate do really cool automation stuff with the content that we found. So, for example, you know, if all documents that have personal identifying information and a PII where we found a Social Security number, a phone number, an address, a name, a password, an account name, email address, anything like that. Once we find out, we can make sure that they're secured.
We can also make sure that if you want to, we can copy them up to a cloud. So once you really know the data and really become intelligent about what data you have, then you can do a lot of stuff with it. In terms of the types of classifications
[00:06:29] Unknown:
and insights that you're looking to provide to organizations, I'm wondering how you've thought about the useful categories and how much you want to be opinionated about it and how much you just want to provide a tool to let people form their own opinions.
[00:06:44] Unknown:
So OPERAVI, our base classification system, comes with a 160, I think it is now. A 160 different classifications. And it'll it'll classify a lot of things. Okay? It'll classify driver's license and passports and, you know, it even works with Bulgarian text and Chinese text. So it's a fairly general, you know, algorithm, and and we're constantly updating rules on what a document should contain in order to be classified as a certain thing. However, that said, you know, 1 of the things we recognize is that we don't know everything. Right? So to force a user to actually use our predefined class ifications is really unreasonable. So we have a drag and drop editor that allows people to set up classification rules. So for example, if, you know, you enter a patient number, a RegEx expression on a patient number, and then you can say, and in this dialogue box, say, and it has a patient name and it has a predefined, you know, diagnostic code, consider it a HIPAA document that needs to be, you know, preserved, so recognized and classified. So the thing is is you can build as many business centric classifications as you need to classify your data. You don't have to use our pre cam 160 plus of them. They're really nice to use because they're very, very smart and, you know, understand document content, all the legal rules, and all that kind of stuff about, you know, what constitutes PII, you know, what constitutes a HIPAA document, etcetera. But, you know, you can have use cases where you need to recognize invoices, you know, a specific type of invoice, you know, for your business.
It's very easy to set up classification rules to recognize your specific invoice format.
[00:08:25] Unknown:
As far as the applications of this technology, you mentioned that you came from the background and perspective of doing data backups for different applications. But, obviously, this technology also has implications for people who are working in analytics and trying to figure out what they have in their data lake or maybe machine learning workflows to be able to understand what is in my training dataset. Wondering if you could just talk to some of the ways that customers who come to upper Avi think about the ways that you're solving the problems that they have or maybe helping to identify problems that they didn't know that they have? That's a great question. You know, 1 of the things that we decided on very early on was that we were not going to try and be a better ediscovery
[00:09:11] Unknown:
product. And we were not trying to be a better, you know, file explorer. And we were not trying to be a better, you know, compliance tool. I mean, what we're really trying to do is enable those other products to be better. I'll give you an example here. You know, our product can be used for backup. Okay? We can copy stuff from point a to point b. Not a problem. But that's really not what we specialize in. What we're trying to do is make your backup product better. So, for example, if you're using Commvault or NetBackup or any of the other backup tools, you know, 1 of the things that you can do is eliminate redundant, you know, trivial data. So once you do that, once you start getting rid of all the underbrush that you don't really need anymore, then make sure backup, you know, your backup window smaller, and it makes the, you know, data size that you have to copy every night or backup every night much smaller.
And that actually, you know, enables you to reduce costs on your backup program. Now let's talk about, you know, ediscovery. You know, given a 100, 000, 000 documents out there throughout your organization, how do you do ediscovery on that? Usually, that's a manual process. So you have to copy it up to, you know, some cloud resource and then, you know, run analytics and all that kind of stuff. But what if we could actually tell you that, you know, these are the 150, 000 documents that may contain what you're actually looking for, then your ediscovery tool only has to deal with a 150 of them, not a 100, 000, 000 of them. So this is the kind of, you know, approach we're taking. It's more of a generalized approach that you can, you know, use to make your existing products better. We're not replacing your other products. We're just trying to make them faster, more efficient, better products to work with.
[00:10:51] Unknown:
In that context of the sort of redundant or useless data, there was a period early in the, quote, unquote, big data era, particularly when Hadoop was starting to come on to the scene where the common wisdom was that just collect everything. You never know what you're going to need. It might be useful someday, which is now, in a lot of cases, shifted particularly in the context of regulations such as GDPR to only collect the data that you know you're going to need right now because, otherwise, you're wasting spanned that evolution looking at the datasets that spanned that evolution looking at the datasets that they have and starting to understand which of those 2 approaches is actually going to give them the approach that they need for being able to produce value from the data that they're collecting.
[00:11:46] Unknown:
We're we're actually pretty agnostic on that. I mean, the thing is we're not gonna dictate to you, you know, keep it all, which, by the way, is unsustainable. I mean, I think we've seen that that model just doesn't work, so we need to be much, much smarter about data. So that's where really Aparavi comes in. We can help you understand your data and only save the stuff that you need. For example, 1 of the use cases that we often get called on is migration. You know, a lot of companies wanna migrate stuff to the cloud, or they're doing a merger or acquisition. You know, a company that you're acquiring has, you know, 500 servers out there. What do you do? Just keep their data center intact and, you know, let half the company run on the old data center and the other half of the company on the new data center? No. That's just not not workable. So it really is about identifying the stuff that actually needs to be brought over from the acquired company into the new data center and can help you intelligently do that by recognizing what you have and what's actually necessary when you're migrating to the cloud. Everybody thinks, okay. The cloud's a lot cheaper. We've all bought that line. But, you know, if you try and actually, you know, migrate your entire data center and all the crap that you have, you know, in your data center right now up to the cloud, that's not really saving you money. At the end of the day, that's actually gonna cost you more money than running your own data center. So, you know, why not migrate only the stuff you need? Only the stuff that's valuable to your organization. The data that has some value rather than, you know, the winning chili cook off recipe from 1989.
That's what we can help eliminate.
[00:13:20] Unknown:
So in some ways, this can be kind of viewed as an aspect of business intelligence where the intelligence that you're providing pertains to the data assets that an organization already owns or that they might be looking to merge with their existing assets in the case of that mergers and acquisitions use case. And 1 of the trends that has been growing in the business intelligence space for a while is the idea of actionable insights where being able to throw together a pretty chart is great, but it's useless unless it's actually informing you of what the next step is or helping you take that next step. And so in that lens, I'm curious what you see as some of those useful next steps that you're able to encourage or assist with from the insights that you're generating
[00:14:05] Unknown:
once somebody has adopted Aperavi and enumerated the different assets that they're working with? Yeah. Pretty chart is wonderful. Right? A nice pie chart in the middle of the screen telling you you're totally hosed because you have a lot of garbage in your system. Totally helpful. Yeah. That's really nice. But you're absolutely right. What do you do with this after we gain that intelligence? There are a lot of things that you can do with it. Once you know about it, we have a different part of the product called actions. Our product is called DIA, data intelligence and actions. Right?
So once we identify and you know what you have, then we can perform those actions on it. The actions include copy. We can copy it to the cloud. We can do information life cycle management. So we can send it up to, you know, s 3 first, then we can copy it over after the ages out, Glacier. You know, Azure has the same thing, so we can do it that way. The other thing is we can just remove it. It's shocking at how much people actually these sysadmins actually say, okay. Just delete all the crap that I don't have from 7 years ago. If it hasn't been modified or touched in 7 years, just get rid of it. I don't even wanna know what's in it. Right? So we can delete it. The other things that we can do is secure it. We can make sure it's secure. So if you have a bunch of PII documents or directories with documents with, you know, SEC forms in it or something that's very sensitive to the company, we can automatically lock it down, make sure that nobody can access it except a set of user accounts.
Another example, working with an HR department. Whenever somebody puts a bad word in an email or a word document or something, we can actually notify HR on the spot that, okay, somebody used a bad word in this document. Now that's a little harsh, but, you know, you get the idea. So it's not only the knowledge, but it's the actions that we can take on that knowledge that we've gained to actually help you manage and automate the management of that data.
[00:15:59] Unknown:
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend. Io, 95% reported being at or overcapacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it's no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. That's where our friends at Ascend dot io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark and can be deployed in AWS, Azure, or GCP.
Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $5, 000 when you become a customer. In terms of the types of storage and data systems that you're working with, you mentioned things like s 3 and object storage and, obviously, being able to integrate with some of the kind of corporate storage technologies like SANs and NFS and SMB shares. And I'm wondering how you also approach things like database engines to be able to understand, okay, these are the types of records that you're storing or just what the overall breadth and scope is of the types of data assets that you are working with to help gain some of that visibility and intelligence about?
[00:17:40] Unknown:
You know, databases is very interesting question. You know, when we first started, it was all about unstructured data, and that's really where we're focused on unstructured data. There are a lot of database tools. I used to work for a company that actually produced erwin in charge of the erwin product line, which gives you a huge amount of insight on database and schemas and things like that. So databases are really cool, but there are a lot of tools out there that actually focus on databases at this point in time. So we try to stay away, you know, to a certain amount of the database stuff. What we're really focused on is unstructured data. You know, files, emails, Teams messages, chat, Slack, you know, all these other things, all these different sources of data that is just completely you know, let's be honest, it's absolute chaos out there when we look at that. So by focusing on that, we can really tackle a large part of the problem, you know, of data within an organization.
Will we ever have a database connector or database as a source? I don't know. I I'm not against it. If we can add value out there, then we can do something completely unique where a tool does not, you know, join databases and, you know, give intelligence on database. Hey. I'm all in.
[00:18:48] Unknown:
In terms of the actual platform that you've built, I'm wondering if you can talk to some of the architecture and design elements that have gone into it and how the overall trends in storage technologies and system architectures have influenced the path that you've taken from when you first began designing the problem to where you are today?
[00:19:10] Unknown:
Our architecture is actually pretty interesting because we're a SaaS product. We market it as a SaaS product. We are a SaaS product because you use a web browser to talk to the upper Avi, you know, platform. But that's about the only part of the SaaS that we are because we actually need to access, you know, data within a data center or up in user's cloud account or, you know, what have you. Where does the data lives? We have to access it. So up to the very top level, we have a 3 tier architecture. The very top level is our platform that's sitting up there and, you know, as the SaaS application, and it can be considered just a huge router to the next level. And the next level is what we call the aggregator. The aggregator is sitting within a data center or within the customer's cloud, and it is the only 1 that has access to the data. The platform, we never store data on the platform. That was 1 of the security aspects that we designed into the product upfront.
AppraBi never has access to your data. It's always within the aggregator. It's always kept on-site. It's always kept close to the data. And the aggregator is responsible for running tasks, you know, examining data, know, classifying data, and all that kind of stuff. Usually, that's a pretty big box. The third layer is really optional, and we call it the collector. The collector is basically run on a production server. You know, the production server, it has some local resource that needs to be, you know, managed. However, if it's an SMB share or, you know, a Teams or, you know, Office 365 email or something like that, the collector is not necessary. The aggregator does, so we don't have to, you know, actually impact production at all. And that was the main tenant that we wanted to keep the burden off the production servers as much as possible. If we can, we go on a side channel, you know, through the aggregator.
As far as what's changing in the industry, okay, so keep in mind that I started way back in the CPM days. So I can tell you all the changes that have occurred in my career, but the cloud really did change some things. And I think, you know, at the beginning of the cloud revolution, it was, oh my goodness. We gotta throw everything into the cloud. You know? That's gonna save us. And I think that we're more at a fundamental place where we're understanding and creating hybrid data centers, whether they be part in cloud, you know, and part on prem. If you're a small business, you know, the cloud is a wonderful thing because you just go out and buy an Office 3 65 license for all your employees, then they manage your does anybody actually manage Exchange Servers anymore for goodness sakes? I hope not. I hope not too. Oh my goodness.
So there are certain things that the cloud is really, really good at, but there's a lot of stuff out there still that will never be put into the cloud. I worked for a company, you know, and the IT departments absolutely said, yes. You can run stuff in the cloud. We can set up cloud accounts. But our accounting systems are not and never will be placed into the cloud. That's just not gonna happen. So, I mean, the reality is today, we have hybrids of, you know, local resources and remote resources in the cloud. And now also what we're seeing is multi cloud too. Some are going into Azure, you know, for what Azure does really good, and some people are, you know, have some of their resources in AWS to support what AWS does really well. So when we talk about hybrids, it is a different monster because now you have to actually architect your product to be number 1, cloud agnostic, but also, you know, on prem agnostic as well. And once you have the right architecture to actually, you know, perform that functionality upfront, you're kind of really hosed. It doesn't really work.
[00:22:44] Unknown:
And so for an organization that is adopting upper Avi, I'm wondering if you can just talk through what that initial integration and onboarding process looks like. And then once they have it up and running, some of the typical workflows where they might interact with Aperavi and how it sits into the overall kind of flow of a, you know, day to day effort for the different sort of user personas that will interact with it most frequently.
[00:23:10] Unknown:
The first real step that we try and do is if a company is gonna go with OperaView or looking at OperaView, we do proof of concept. You know, the thing is we take small chunks of the problem because, you know, when we look at the overall problems that the most corporations have, most companies have, It's absolutely daunting. So let's just take a little piece of it. You know, set up your rules, set up your classifications, set up your policies, set up your and just start working on that. And then we start expanding things. 1 of the first things is we do what's called a quick scan. And the quick scan, essentially, you know, does a very quick, you know, metadata search on, you know, what type of file it is and, you know, when it was created, its size, who accessed it last, its permission to who's the owner, you know, some very basic information that we can get fairly quickly.
You know, once you understand that, then you can actually start looking at, okay, I need to concentrate on JPEG files or I need to concentrate on doc x's or, you know, I have some huge video files that I need to, you know, examine. But then once you start going further, then you can turn on indexing and classification. Indexing and classification, because of the, you know, data that we actually have to bring in, harsh, you know, and possibly do OCR optical character recognition on it as well can be quite a time consuming process. So that's why we try and do the quick scan upfront to get an idea of, you know, the problem set that you have. Then we start developing your policies, you know, and your action for what we find with that, Turn on the index. And once you find out about once you have knowledge about it, then you can start creating your action policies like, okay. Anything older than 7 years, you know, just get rid of. Right? They're not a pretty typical, you know, policy that's put into place. Well, once that's set, you don't ever have to look at it again. Rest assured that every time the policies are kicked off, that anything over 7 years will get deleted.
Or if you want to send it to Glacier before you delete it, that's an option too. So send it to Glacier. Once it's out of Glacier, then we'll go ahead and delete it. Once the policies of, you know, the corporate data actions are set up, then it's kinda set and forget. You don't have to worry about it. You just monitor it and, you know, spit out reports and say, okay. Well, I removed this because of this, and I copied this up here because of this. And, you know, so then you just monitor it. So it's not something you have to go in and say, oh my goodness. Do I have to switch a tape? It's not that kind of system. It just happens. It's automated. Did that answer your question, Tobias?
[00:25:37] Unknown:
Yeah. Definitely did. And to the point of being able to put things on autopilot, it's definitely always helpful to be able to say, okay. I can trust that this is going to happen. I don't have to, you know, maintain that cognitive burden of remembering to, you know, every year go and make sure that this particular set of files gets cleaned up or that I go and archive this set of files after a particular point in time. You know, that's the whole point of computers is they do the things that are routine and route so that I don't have to remember to do them, then yet somehow we still end up
[00:26:06] Unknown:
in that space of having to say, okay. Well, the computer makes things easier, but now it just made different work for me to do that I have to keep remembering to do every year. Right. Right. Yeah. So the set and forget is really, really important. I mean, if you want to do an audit on it and find out why it deleted something, you can always call it a manual and, know, through the user interface and say, you know, this file is gone. Why did you get rid of it? And we'll tell you why we got rid of it, you know, because it matched this policy and so on and so forth. So but, you know, what we're trying to do is actually make IT's job easier by automating this stuff, reducing the data footprint that you no longer need,
[00:26:42] Unknown:
not adding to your burden of actually having to hire 5 different people to monitor this thing every night and say, oh my goodness. You know, we need to kick this off and do this and this and this. It really is set and forget. Absolutely. Going back to the question of categorization, you mentioned that you have a number of different utilities for being able to say based on these different heuristics, this is the type of file that it is, or this is the type of information that it contains. And I'm wondering if you can just talk to some of the types of heuristics that have been most adaptable and some of the ways that you've had to go in and kind of customize for particular use cases or the utilities that you provide for organizations to be able to build their own rule sets to say, anytime I see, you know, this sequence of characters, it actually pertains to a particular, you know, product identifier within our inventory catalog, or, you know, it pertains to something that is specific to our organization or our business line and sort of what that process looks like for understanding what the balance is between we'll just use everything out of the box, and these are the things that we really need to make sure that we get right because it is kind of critical to our business to have that information about which assets have this information.
[00:27:56] Unknown:
Let's talk about out of the box classification first because it's really cool and really interesting how this whole thing works. So,
[00:28:05] Unknown:
stuff is pattern matching and regular expressions and, you know, stuff like that. But if you're just looking at patterns, that's really not enough. Because if you look at a a way a Social Security number, for example, can be actually represented in a document, there are several different formats. Now when we talk about dates and times, there are lots of different formats for dates and times. Right? 1 of the things we do, we don't use just a simple regex pattern matcher. We actually try and understand what the numeric value is in context of where it's at. Now let's say that we have what looks like a Social Security number, which has 3 digits followed by a dash, followed by 2 digits, a dash and 4 digits.
We are, let's say, 50% confident that that is a Social Security number. Right? So then we go 1 step further and we look for the word social or the word security around it, within it, near it, and see if we can find that. Then we boost that confidence level up and say, okay. Now we're 70%, you know, confident that this particular number is a Social Security number. If we see Social and Security number, then we may boost it up to 90%. If we see SSN, we'll boost it up to 90%. So it is really, really intelligent on trying to figure out what's in there in the context of the whole, not just, you know, simple pattern matching. So when we look at that, that's 1 rule. That's just matching the Social Security number. But let's say that we wanna classify something as a patient document. Right? It may have, you know, xxxdashxxdashandthelast4digits of the Social Security number. We can actually match that pattern. But, you know, that's fine. We're 90% confident that's a Social Security number.
But then if we want to classify that as a patient, you know, maybe a patient document or a patient communication, we can actually match a patient number as well. And this is where the custom rules really come in, you know, very nicely because we can say, you know, if a particular document has if we're 70% confident that this is a Social Security number, 90% confident that this is a patient number. And let's say, you know, it starts off within the first 100 words of the document saying hello, you know, followed by a colon, you know, saying hello, you know, mister patient colon. They were technically, we're maybe 90% sure, 95% sure that this is a patient communication flag. We we, you know, classify it as a patient record. Then you can deal with that individually.
Then you can say, okay. Copy all patient communications up to s 3, long term archive up to, Glacier. That's the kind of operations that we can perform on that, but it really is key to allowing a user to specify these classifications and build classifications that support the business rather than just our predefined rules that we view as important.
[00:30:58] Unknown:
Going back to that aspect of being able to put a lot of this file and data management on autopilot is the question of reliability and redundancy and some of the ways that you have had to engineer around various edge conditions and failures and automatic retries because of the fact that you are spanning so many different systems and you want to make sure that when I want to copy this file into, you know, an off-site backup or when I want to replicate this file, that it actually is, you know, being reproduced bit for bit and I'm not introducing corruption and just some of the kind of vagaries that go into data management and file management, particularly when you're working at large scales and, you know, small corruptions here and there that are infrequent, you know, from a statistical perspective become a inevitability at a certain scale into some of the ways that you've had to work around and engineer around that?
[00:31:53] Unknown:
1 of the things that we do is before we actually do anything with the file, we always run a 5 12 signature on it, and we store that. So when we get a file back from the cloud or we send the file to the cloud or something like that, we compute the SHA512 signature and make sure that it actually, you know, did what it was supposed to do. In addition to that, we used the standard Cloud CRCs, you know, assured transfer, you know, and reliability of the information. But, you know, 1 of the things that we found is moving things around. You have to think about all the holes that data can fall through before it gets to its, you know, final destination.
And that's where we do the redundant things, like, we never delete a file until, you know, we're absolutely sure we committed the file up to the cloud. You know, you can actually run a verification stage on it, which unfortunately download the file again. But, you know, at that point in time, you know, if you want that level of security and functionality, you can do that. So it really is up to the user on how much assuredness they want. You know, for really important stuff, absolutely, you should do that. No question about it. But, you know, just stuff that you're sending up to Glacier because you don't need it anymore, and you have no place else to put it, and you're scared to death to delete it, so you need to put it somewhere. It's okay. You know, if Amazon told us it, you know, got proper CRCs and it's ended up correctly and it received it correctly and it acknowledges, then we'll go ahead and delete. So it really depends on, you know, the type of data. And this is all under policy. So, you know, the important stuff you can set to very, very high level security. The other stuff you you know, the low stuff you can say, okay. Well, just as long as I'd said I made a copy of it, that's okay.
[00:33:34] Unknown:
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state of the art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up for free or just get the free t shirt for being a listener of the data engineering podcast at dataengineeringpodcast.com/rudder. As far as integrating across the overall kind of data platforms and data use cases in an organization, what are some of the interfaces and extension points that you've built into upper Avi for being able to populate things like data catalogs or being able to, you know, make sure that certain unstructured data assets are being loaded into the staging location for a data lake or, you know, making sure as you identify assets that fit into a particular category, you may be loaded into a training dataset for machine Aparavi's point of
[00:34:47] Unknown:
view, it's really interesting how we architected this because Aperabi's point of view, it's really interesting how we architected this because we really don't care what the source of data is and what the target of the data is. So it's very simple to actually write a custom connector that will connect to your data lake, you know, and and your particular aspect. And by the way, it's all in Python. So love Python. You know, it's very simple to actually integrate the product into existing processes and pipeline. So for example, if you wanted to take an action, you know, and send it over to, you know, a target, which happens to be some code in Python that sends it up to your data lake and does manipulation of that, that's cool. No problem. You know, write a Python script, and we'll send it over to it, all the data that you need, and you can do with what you want. Same thing on input. On input, if you have some special, you know, IoT data or something that you want to catalog and keep and manage and all that kind of stuff, That's very simple. You can write a Python script to, you know, import it into the system, and we'll keep importing it and sending it through just like it was a data file. So the whole concept of the abstraction of data, you know, data is no longer just lives in files. It lives statically. It also lives dynamically.
So, you know, we have to account for that, you know, how the processing of this data works and and the pipelines that we put it through. Did that answer your question, Tobias? Yes.
[00:36:13] Unknown:
Python is the ruler of all things integration. 1 of my favorite quotes is that Python is not the best language at anything, but it's the 2nd best at everything.
[00:36:22] Unknown:
Yeah. I I love Python. You know, as a matter of fact, our engine that does all the data movement in there has the embedded Python interpreter in it. So it understands in 1 Python scripts natively, so you don't even have to have Python installed. It has all the libraries that you need to, you know, natively support the Python scripts that you write. It's really. You'd love it. If you love Python, you'd love it. Absolutely.
[00:36:45] Unknown:
And so for people who have been using upper Avi and as you have been growing the business and working with customers, what are some of the most interesting or innovative or unexpected ways that you've seen it applied?
[00:36:57] Unknown:
Okay. So this is a really strange use case, but we have an architectural firm. Right? And this architectural firm has thousands of drawings, just literally thousands. And they couldn't identify what the drawings were for. So what we did is we actually they brought us in to actually do OCR on all these drawings, classify them. Now since we can actually read the text off the drawings, they can do very interesting things with those drawings, like for all drawings with the living room in it. Right? Because we recognize living in room. But it was really taking this huge amount of data that was completely unstructured because, you know, it was not in any kind of format. The file names didn't, you know, match up with what they actually were. By doing the OCR on these drawings, we could actually say, okay. This is for customer number 1. This is for customer number 2, and so on and so forth, And help them completely organize these thousands of drawings that they had, you know, across the organization. So that was actually really surprising. I guess I was surprised at how screwed up it was in the 1st place. Why would anybody do that? Right? Why would you name something name a file that you couldn't actually figure out what it was without opening it up? But, you know, that long story there, but it happens. So that was 1 of the interesting things that I found on, you know, use case. And that 1, I actually enjoyed working with. So Yeah. To the point of bad file naming strategies, it also brings up the classical
[00:38:21] Unknown:
example of people who replicate Git by just copying the same file multiple times rather than using a tool that's actually supposed to manage that workflow.
[00:38:30] Unknown:
Oh, yeah. Oh, and by the way, that's 1 of the reasons why we built the SHA signature on there, the SHA512 signature, is then we can identify duplicate copies of the data across the organization too. So if you wanna know how many of those chili cook off recipes from 1987 you have, We can tell you that.
[00:38:51] Unknown:
Yeah. Well, everybody loves chili. You gotta make sure you have it backed on multiple places. You know what? It was the company chili cook off, and it probably wasn't very good anyway. So And in your experience of building the Aperabi product and digging deeper into this problem space and helping organizations gain more insight and value from the data that they have, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:39:18] Unknown:
1 of the surprising findings that I've had is the amount of absolute garbage that we're storing out that people actually store. It's well, okay. I will admit that on my laptop, I do have that chili cook off recipe from 1987. Whenever I get a new laptop, I just say, okay. Copy my whole documents folder over and just move it forward. So I have so much crap out there. It's amazing. But, you know, at a corporate level, that really doesn't scale. And I'm really surprised at the size of that data and the percentage of that data that is just absolutely trash. And then when you start computing, okay, how much is this costing me on my pure array, right, to store, you know, 50% of the stuff that I will never use again?
It surprised me. I figured, okay, maybe 10, 20%. But it's up 50%. It's shocking at how much garbage people actually have. And the problem is you have a 100, 000 files, you know, in in this directory structure. Which is cheaper? Are you going to pay 5 people to go through and see what you actually need to keep? Or are you just gonna go out and buy another, you know, terabyte of disk storage? It's much cheaper to go out and buy another terabyte of disk storage. Right? That is precisely why the problem has gotten as bad as it has. But, you know, if we look at the future, it's really, really not gonna sustain itself. Absolutely.
[00:40:38] Unknown:
And you're also running up against the kind of emotional visceral reaction of deleting things of, like, well, I don't know if I might actually need that. How do you approach some of that kind of resistance and pushback from customers when you say, actually, you can delete these 10, 000 files from your Santa Ray because nobody cares about them and they never will.
[00:40:58] Unknown:
Yeah. Well, you know what? Take it off your very expensive Santa Ray. We will send it to Glacier for you. So in case you do need it, you know, it's available. But you know what? We're gonna pay a fraction of a cent per gigabyte on that. We're gonna have significant savings on that. So just for your peace of mind, we'll go ahead and throw it up there. That's a very big compromise. I would say just get rid of it. But, you know, the compromise is send it up later. Nobody's ever gonna use it anyway. And even if it takes you 8 hours to get back, which is what the quote, so be it. At least it's there.
[00:41:31] Unknown:
Talking about this, you know, to your comment about the documents folder that you've been propagating across laptops, it makes me think that I should, you know, use OperaAVI to scan my network drive and see how much garbage I've got on there that needs to be deleting that I just haven't taken the time to clear out.
[00:41:44] Unknown:
You would be shocked. Yeah.
[00:41:48] Unknown:
So for people who are interested in being able to gain those insights into the data that they're storing and understand what they can do to clean it up and organize it and categorize it and, you know, delete the things that nobody cares about, like that chili cook off recipe? What are the cases where Aparabe might be the wrong choice?
[00:42:08] Unknown:
You stomped me on that. I gotta be honest. That's first time anybody's ever asked me that question. When you're dealing with structured data, it really isn't a bad choice. And the reason why is because it's general. It really helps you know your data. And regardless of what kind of use case you have or, you know, what you're trying to do with the data. Before you can actually do anything with your data, the first step is actually getting to know your data. You have to know what you have before you can address any kinds of problems with it. So as far as where Aperavi does not fit, I'm not sure that I could come up with a use case where it doesn't really fit for that. Is there a particular scale where it makes more sense to adopt upper Avi and below that, you know, it's actually just quicker and easier to do it manually? Or I think if you're just scanning your own laptop okay. So maybe it would be negligible on, you know, reducing redundant and trivial stuff that's on my laptop because I'm not actually gonna delete anything off my laptop because what if I need it 6 years from now?
So 1 of the things is is let's talk about scale. How does Opera scale? And, you know, when we talk about OCR and we talk about reading documents and stuff like that, and let's say you throw a 1, 000, 000, 000 of them at it, you know, 1 of the things we recognize very early on is that scale is a real problem. You know, if you try and get a single machine to actually do this, you know, to classify read all these documents, you know, maybe 3 years from now, we'll actually get it done, you know, if it was a single machine. So 1 of the things that we have is workers who distribute, you know, the loads of all the stuff, this processing of data onto, you know, different workers. So we actually can, you know, classify index many, many, many documents concurrently depending on how much resources you wanna throw at it. So when you talk about an upper layer, I don't really think there's really an upper layer. I mean, you know, what we've looked at and what we're testing against, we're constantly, you know, increasing our test lab, you you know, to make sure that we can scale up to the places that we want.
But if you came to me and said, okay. I have a 1, 000 petabytes of data that I need to scale, and I'm gonna say, okay. Throw a 1, 000 machines at it, you know, to do the indexing and classification, and we'll get it done in 30 days. That's the kind of thing that we actually, you know, have built into the product. So I would worry about that. You know? And the reason why I would be worried about it is that if you try and go off and do that, you know, yourself, you know, your policies may not be right. And you should probably consult with, you know, 1 of our systems architects at Opera Avi to make sure that, you know, before you invest a month's worth of, you know, a 1000 machines computing time, that we get it right upfront. So that would be my only worry on that. But, yeah, it's really designed to go from the very, very low level, you know, and branch office levels, branch office, you know, installations, home installations.
She can actually pick up the CEO's laptop and the CFO's laptop and make sure they're complying with regulations and stuff like that. So it really is versatile. I hope that answers the question. Yeah. No. It's definitely useful.
[00:45:06] Unknown:
And so as you continue to build and iterate on the upper API platform and tool chain and work with your customers, what are some of the things you have planned for the near to medium term or any problem spaces that you're excited to dig into?
[00:45:19] Unknown:
You know what? What I'm working on right now as CTO, I love this field, and that's machine learning and specifically things around, you know, speech to text, object recognition. 1 of the other things that's really exciting is when you're given a document, yes, you can parse out the text out of it. You can look at the reg x's and pattern matching and, you know, you have a confidence level and all that kind of stuff. But it's really not telling me what that document is. Right? Is it a lease agreement? Is it a letter to my kid? Is it, you know, a specification, a design document, or something like that? You know, pattern matching doesn't really tell us that. That's 1 of the things we're really doing research in the machine learning is really trying to understand what a document is. Why does it lift? That's really, you know, to me, the holy grail of what we're trying to do, and we're spending a lot of money trying to figure that out, crack that nut. And another thing that really is exciting to me is okay. So this is internal. You can't tell anybody. It's called intelligent numeric processing.
And intelligent numeric processing is really an interesting idea because, you know, if you go into Google and just, you know, do a Google on 10 k versus 1000 versus, you know, 10 comma 0, you get 3 entirely different result sets. Okay? Well, the question is why. When you have a document, I wanna be able to say, okay. Give me all contracts over $10, 000. Right? Or give me all contracts between $10, 050, 000. Regardless of how the user actually wrote them as dollar sign, you know, 10 k or dollar sign, you know, 10, 000 or whatever. So that's where we're putting in some research and time on that and really excites me. Because once we do that, then we can do things like intelligent processing on date, on units of measure, and things like that. So if you have, you know, for example, European subsidiaries that are are dealing in, you know, euros versus US subsidiaries that are dealing in dollars, we can normalize those searching and, you know, present a unified view of the documents for search and, you know, discovery.
[00:47:27] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:47:43] Unknown:
That's another tough question. I don't think it's a deficiency. I think it's actually designed, but the design, it makes it so easy to copy data all over the place, and it just replicates like, you know, rabbits. And so I think a lot needs to be done to, you know, reduce the number of copies that we actually can easily make.
[00:48:03] Unknown:
Absolutely. Well, thank you very much for taking the time today to join me and share the work that you've been doing at Aparavi. It's definitely a very interesting product and an important problem space. So I appreciate all the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Very good. Thank you so much, Tobias. Thank you, everyone. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to the Episode
Interview with Rod Christensen
The Story Behind Aparavi
Understanding and Managing Data
Applications and Use Cases of Aparavi
Actionable Insights from Data
Integrating with Storage and Data Systems
Aparavi's Architecture and Design
Onboarding and Workflows
Classification and Heuristics
Reliability and Redundancy
Integration Points and Extensions
Interesting Use Cases
Lessons Learned
Future Plans and Machine Learning
Biggest Gaps in Data Management