Prepare Your Unstructured Data For Machine Learning And Computer Vision Without The Toil Using Activeloop

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Have you ever had to develop ad hoc solutions for security, privacy, and compliance requirements?

Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data?

Satori has built the 1st DataSecOps

platform that streamlines data access and security.

Satori's DataSecOps

automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift, and SQL Server,

and even delegates data access management to business users helping you move your organization from default data access to need to know access.

Go to data engineering podcast.com/satori,

that's s a t o r I, today and get a $5, 000 credit for your next Satori subscription.

You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain.

Now there's a book that captures the foundational lessons and principles that underlie everything that you hear about here.

I'm happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O'Reilly to publish it as 97 things every data engineer should know.

Go to data engineering podcast.com/97

things today to get your copy.

Your host is Tobias Macy. And today, I'm interviewing David Booniatyan about ActiveLoop, a platform for hosting and delivering data sets optimized for machine learning. So, David, can you start by introducing yourself?

Hey, Tobias. Thanks for having me here. My name is David. I'm founder of Factuloop.

Before that, I was doing a PhD at Princeton University,

where I was dealing with really large datasets at petabyte scale.

And we had this process of running machine learning and deep learning on these huge datasets

to reconstruct the connectivity

of neurons,

And we had to optimize, drill down a lot of stuff, and just creating datasets and maintaining them is a big challenge. And that's what motivated and helped me to sell the company 3 years ago. Do you remember how you first got involved in data management?

Yes. So I had a task to train a deep learning model.

And you're saying it's very easy to get started. The basic like, you get started with ImageNet. You have to train this classification model, try to get to the same results. It's like, let's say, you get non AlexNet.

Problem with that is that I spend myself, like, 2 or 3 weeks just

unzipping all the ImageNet files, getting access to it, wrapping up, and then making the format, get the data loggers correctly done so I can plug this into TensorFlow or PyTorch to train a model.

And, like, that felt a little bit slightly demotivating because I think that I'm starting a PhD to train models, but, actually, I spend most of my time on doing data engineering work. And things get worse when, actually, the data is on 100 GB. It's actually 100 TB

size. And the time is, like, most spend on actually looking into the dataset is itself rather than trying to innovate on the model side. That brings us a bit to where you are now with ActiveLoop. And you mentioned that you started it about 3 years ago. So I'm wondering if you can just give a bit of an overview about what it is that you've built there and some of the story behind your decision to build a business around this, to tackle this problem, and what it is about this particular problem space that is motivating you to dedicate so much of your time and effort to it? As I mentioned before, when I was at the neuroscience lab, we overlooked into the training aspect of the models. Focusing on models, models, getting them deployed

efficiently, and so on. But the challenge we had is that just processing a petabyte scale dataset on AWS was costing us $1, 000, 000. And our goal as data science and computer science in the neuroscience lab was to reduce this cost 2 or 3 times.

And to be able to do that, we had to rethink

how the data should be stored on the storage, how it should be moved from the storage to the computing machines. Should we use CPUs and GPUs and what kind of models to use? And this kind of insights helped us to think about what are the things that we can actually move this process and make it much faster, efficient, and more importantly, for data scientists

to take less spend time on data itself, or then, like, all

this time that is kind of wasted and can be reused for more important stuff and automating this process. And that's how we got into Y Combinator 3 years ago. We raised our seed fund, and we have been start we started working with early customers. 1 of the customers we work with, we had they had 80, 000, 000 text documents,

and the problem was to train an embedding model, take this text data, put them into vectors so they can run a very efficient search later. The problem they had as a data scientist and data engineering team, they it just the training itself was taking 2 months. And we helped them to reduce the training time to a weak project, cut compute and storage costs,

and, like, use a much more complicated model that resulted in much better results. Another company of ours,

in Toniar, had

air flights over the fields in Illinois to collect a lot of air images to be able to provide insights to the farmers. The problem they had is that they had to identify where there's a disease in the field or dry down area.

And they collect, like, petabyte scale data and put on AWS S3, and they have to train all these models. And we help them to be building all the data pipelines going from the answer to the models. And what we learned so far is that, actually,

know what? The models is not the problem. Like, you can get pretty good results. Like, you can get up to, like, plus 95%

accuracy on the challenging problems

that actually solve a real problem today on, I don't know, in NLP or in AI imaging or computer vision. But getting to that stage where you have this cycle of dataset creation

and model training clocking to, like, an organism that not only models evolve, but also your datasets, that's a big challenge for the companies. And even if you go on, like, Google search today, like, give me a database for images, let's say. What you find out on Stack Overflow, they all say, hey. Like, store all your data or metadata into a database and then point files

or proxies to, like, blobs to where the data is stored. And that's where you think, okay, that should be a database for or at least, like, kind of a store data store for how this data should be stored. So it very efficiently plugs into machine learning use cases. And that's where we built an open source solution called Help that helps data scientists to very efficiently represent

structured data and then stream them to deep learning frameworks more specifically now supporting PyTorch and TensorFlow to train those models and, like, kind of rapidly evolve all these dataset iterations.

And you're mentioning that 1 of the bigger points of friction is the specific way that the data is stored because in analytical use cases, for instance, a lot of the data is stored as parquet files, or you might be dealing with just these large image files that are optimized for human viewing, but not necessarily for machine processing. And I'm wondering what are

the

areas of friction and the ways that the storage of the data could slow down the actual development process for machine learning workflows?

I was looking into a recent survey, and, basically, the survey starts as, like, 1 third of ML projects, they fail because you don't have a solid data foundation.

And solid data foundation both includes the storage and also includes, like, making sure the data is clean, making sure it's, like, version controlled, and so on and so on and so on. So the problem is very big for the machine learning

space as overall.

And then you have structured data or tabular data where everything is sort of figured out. You have parquet. You have, like, version control, let's say, you can do with Delta Lakes. You can use databases if you want, like, high guarantees on asset transactions. And you have data warehouses, you can run queries at large scale. But when you go into, like, spaces where the data is like, okay. I have, like, a lot of videos. I have a lot of text data. You sort of you can now figure out with somehow, like, using maybe MongoDB or, like, other data source. But, like, I have a lot of audio files. Can I how can I query on this data? So then all this gets into the, like, fundamental of, okay, what is the best way or what should be the way actually to sort this data? So you can think of us or the solution that we are building as n dimensional parquet files. Instead of treating the data as to be, like, okay, everything should be tabular, now treat everything should be, like instead of having columns in the table, you will have tensors in the columns, which are just n dimensional

arrays. I was recently talking to a professor at Berkeley.

That's like Berkeley came up with all these, like, databases and very interesting and, like, non intuitive things that he asked me is, like, what is the you know what the best thing that is, like, we are very proud of that we came up with over the 40 years is that we can store the data in tabular format, And then the rest is, like, just a consequence of it. And the way we treat this is that, okay, what if we take the tabular format into a next level where we can also, like, represent

images,

more videos, and if you somehow think they're actually more structured data than the traditional, let's say, Tableau data, but there are a lot of challenges with, let's say, you have to support dynamic shapes, you have to support rank sizes, as I mentioned, and how to efficiently represent them, any, like, cloud storage location

to stream them to a compute engine. That's later you can build, actually, a data warehouse on top or a database on top. Treating

the data as tensors in the storage layer obviously makes it much easier for the mathematical models of the machine learning process to be able to interact with it. And I'm wondering if you can just give a bit of a compare and contrast to what projects such as Pinecone are building with a vector database and any other movement in that space that may be coming up in recent years?

To my best knowledge, there are actually 2, like, branches you can think of when people say, oh, I'm storing down such a data. There's, like, vector databases, and they also, like, graph databases. They both claim that they are doing onsets to data, which is true in some sense as well. The problem with vector database I mean, obviously, you can take any 10 0 column and then put them into a vector and then store it as usual. The the most thing that they are very focusing on, and which is a big problem as well, is, like, doing a very efficient search. So you have a lot of vectors. Like, 1 first use case that we work with, we train a model to take all these text documents, like all the patents, and then put them into vectors so it can put into Elasticsearch and then run a very efficient search on top. With graph databases, the problem is more on, okay, how can I represent this structure?

It could be, let's say, in products, you have, like, this 1, 000, 000 different graphs, or the social networks, so that I efficiently, I can, like, get generate insights from them. But none of them actually focuses on like, let's say, if I image, I put them into a vector, then I had, like, when I start to access this data and train a model on top, I have to somehow reformat it again to plug this into my ML frameworks. That's the difference that we actually

I mean, obviously, there's some, like, whenever you put the data on bytes, it's a it's a 1 dimensional array. But the way the models treat that, they think that this is very native. I think the like, representing an MS tensor is not anything new there.

The main, I think, insight is that, hey. Deep learning models, they take an input tensors, output a tensor. They don't really care if it's an image or video or maybe a structured data, so that's what they care about.

And we felt that, hey, instead of actually having all this preprocessing

done before the training process, why don't we do the preprocessing before we put the data into a dataset store so that it can natively plot into to, like, a deep learning framework or any machine learning from where you can generate insights? And then you can obviously use the same format to run a vector search.

The graph representation

will be slightly more tricky to do on top of this since it won't be that very idle, but it's still doable. And, yeah, and let the data scientists decide what they want to do with the data. As I was going through the documentation

for the active loop platform, form, I was noticing that a

majority of your current focus is on image and computer vision workflows.

And I'm wondering if there's anything about the specific needs of computer vision workflows and sort of the requirements in industry as far as that being a focus that led you in that direction versus

using, you know, any sort of focus on natural language processing?

So first, everything in SAS is actually from passion. So I'm really passionate about computer vision and why it could be potentially helpful for problems we solve today, self driving cars, in agriculture, in, like, medical imaging, and so on. Secondly, there's a big gap between like, computer vision has been just recently developed, like, the last 5, 7 years. And there was not a lot of infrastructure around the tooling so that you can build it. Another insight you can take from there is that, okay, now we are getting into this phase where we deploy, let's say, self driving cars, we deploy a lot of robots and so on, but there's a big trend of generating a lot of unstructured data. And there's a huge challenge in the I'll say in machine learning as a whole, is, like, how we can generate insights that are valuable

and generate revenue. So

so far,

extracting value from on such a data, like, especially images or videos, it's it's actually a challenging business. I mean, you see all the excitement in self driving cars, but and all the, like, IPOs you see there, but not many of them figured out the business model, like how to generate revenue from it, or not many of them, like, computer vision company startups that are just, like, getting started with figuring out, okay, how this is gonna fit into the whole the gay economies at scale. The challenge we see is that, okay, assuming that this and we believe that this market is gonna be very big, they definitely will going to have this like, the need for having a very, very good way representing the data, and and there's no any current solution doing that. So that's where we can find our niche. But at the same time, the same technology could be actually used for structured data. We heard some interest

from folks from our community

that, hey. Actually, you know what? I want to store this 500 GB of, like, kind of time series slash tabular data on Hub, and we asked them why do you want to do that. The problem with that is that I wanted to start with Postgres. Postgres, like, you have to pay, let's say, $500

per month on storing all this data in memory. And I need to be able to have, like, much cheaper

data representation. But at the same time, I want to have the querying capabilities that can run very efficiently on top. And that's where, okay, yeah, you can store this data on Hub, and then it actually sits on s 3, so it's, like, 10 times cheaper in terms of the infrastructure cost. You don't have to store this data in memory, and you can stream the data to the computing machines as if this data was local to the machine. So you can actually benefit all the kind of querying capabilities given the use case

that went very well. So having said that, with images,

you won't store the data in memory, it's very expensive.

And instead of having 500 GB of data, you can have like 10 or 100x more data, and then the cost gets

complicated. Nobody actually stores the data, let's say, images in MongoDB, unless, like, you see, images are big. And then you go to the next level where okay. Instead of having just, like, let's say, full HD images, I have aero images that are 50, 000 by 50, 000 pixels. Or at least in the lab that I was working at the neuroscience, we were taking a 1 millimeters metric volume of a brain, cutting into very thin slices.

Each slice was, like, 100000 by 100000 pixels, and we had $20, 000 slices. There's no any format or solution

that supports that, so that's why we in the lab, we decided, okay, we can build this specifically for this specific use case. And that's where you see the biggest benefits actually if you want to deal with, like, large aerial images or biomedical images and so on. Yeah. That's, like, kind of a long answer to your question. But I hope, like, I covered all the aspects that you were looking for. Yeah. It's definitely interesting the amount of potential that there is in being able to let computers actually process and understand

visual inputs because of the

very sort of site oriented nature of how humans operate

and the number of activities that require some visual input for being able to interact with the real world environment, thinking in terms of, like, robotics or self driving cars or to your point of, you know, biological imagery, being able to do research in terms

of biology, particularly at the micro scale.

Yeah. Definitely very interesting problem areas. I'm excited to see some of the ways that this can be used for that. And so digging more into active loop itself, can you talk through how the platform is architected and some of the optimizations that you've built in as far as being able to store and serve these processed data formats to be able to the machine learning workloads

read in and process these tensors? Let me get started with actually just going for a very high level, like, kind of principles that we follow, but, actually and then I'll get into more, like, technological insights or the assumptions that we take to be able to do that. So the first 1 is that we have been strong on making the APIs very familiar to the data scientists, like having this NumPy like or containers upon us like access to the datasets. That feels very natural, and they don't have to go into the API reference for checking any all the APIs. Like, can we give them a familiarity that they in an ideal world, when you're building a dev tool, you don't have to look into documentation. It's not the case, obviously, but that's, like, the dream, where you get, like, similar to b to c apps, you don't want to look into the reference or manual. That's the first, like, kind of the target that we are looking into. The second principle that we follow is on performance, like how can we make sure that we continuously benchmark every line of the code to make sure that the hub is getting improved and optimized

every time?

And the third 1 is the extensibility. Can we make it such modular that Hub can actually become the Hub for the machine learning infrastructure? So you can easily integrate with labeling tools, you can easily integrate with computing tools, or the cloud, and so on. So those are the 3 main kind of like high level principles

that we follow. And let me get into more exact data architecture side of things. I mentioned like, 1 of the keys is taking, let's say, 1, 000, 000 images into treating them as a single tensor of 1, 000, 000 by 512, by 512, by 3. It's

more how can we lay out the essentially, if you want

speed and performance,

you have to figure out the data layout for specific use cases. And for our use cases, the the top priority is training a model on the cloud where you can have either a single GPU or a 100 GPUs at the same time accessing the data.

The way we do that goes against Hadoop ecosystem or mopreduce ecosystem, where you have to rail up the data to each of those machines

and then bring it back to the compute and then run the

map and then reduce. That's actually not working very well for deep learning applications, as we see with Spark.

Then can we store the data in a centralized location but keep the same performance or guarantees for them? And what we found out is that if you have these 3 assumptions working properly, you can actually guarantee this. The first 1 is obviously the network speed from the storage to the computing machines. They are still bounded by physics like law of physics, so there's no way we can pass the light the speed of light. The second 1 is more we can have a control over is how the data is stored, how that like, you take the data and then chunk them into,

storage, how you format each, like, arrays or the images onto chunks and then how you represent that on s 3. And the third 1, which is very interesting, is this data obliviousness,

which basically says that before I start the computation or during the computation,

I can know the access of the data, of the pattern of the access of the data. So, basically,

I will know that I want to ask the first 1 first index, second index, and the third index of the images

It could be, like, randomly shuffled. That's not the problem. But the problem is that

I can use the networking to prefetch the data so that the machine thinks the data was local while it's running the computation.

And if you have these 3 assumptions, what you can show, and that's what we do, is that

you can have the datasets stored, let's say, on AWS s 3, but streamed to the GPUs as if the data were local to the machine. So basically, for the GPU,

that's the most expensive resource, we have all this in your whole pipeline. It doesn't feel any difference reading the data from local SSD or actually streaming it over the network.

And that's where you see the kind of technological architecture there, that's where you isolate storage and compute.

At the same time, they are, like, in kind of the symbiosis that they feel that they are very local to each other. And then as far as the

data modeling considerations

for people who are looking to store their data in the hub, I guess, what are some of the

preprocessing

workflows for being able to start with a source image dataset and then process it and store it into the ActiveLoop platform so that it is able to be able

to allow for arbitrary

access into these different tensors or being able to parallelize

across the tensor arrays to do sort of multipath processing?

With the previous version of hop, it was very tricky. But we have spent for the 2.0, we spent so much effort on thinking how we can make it very simple

for data scientists to be able to transform their original data. Currently, what you can do is you basically go and say, okay, I need these 3 tensors, and then let me just treat them as lists and then append all the data there. And then we'll do all the heavy work behind the scenes to make sure that we take correctly the data and then structure them. You can also use a feature that we just pushed actually today. It's called transforms or compute, which can actually parallelize this process. So you just define a lambda function and then run over the data. But I think what's coming in the next few months is the single liner that you can point to a folder structure. We extract the metadata from how you store the data and then transform them automatically for you. And what we found out is, like, in a very naive terms, is that we can translate your format so your data into our format

faster than you can just copy on a Windows machine. Or, like, you can copy the data from a local storage to the AWS s 3. We have been very focused on because we felt that that could be a friction before until this becomes mainstream. So once this is mainstream, you don't have to anywhere anyway store, like, your original raw format data. But since we are doing this transition, we have to make sure that's very convenient and easy to use. Another interesting element of what you're building, particularly with the hub project, which is the open source library for being able to interact with these workloads, is that you do provide the active loop managed back end, but then also people can just use arbitrary object storage, whether it's s 3 or GCS, etcetera.

I'm wondering what are the feature and performance trade offs between those 2 options of self managed storage locations versus what you're providing with the managed active loop

platform? We wanted to make it open source from day 0. So we don't, like, actually force anyone logging into our platform or managing. So that's why we're like, hey, guys. Like, you can use s 3 or you can use GCS and Azure to sort all your data. But at the same time, it's like, hey. Actually, we can do things for you that's much more efficient

or that will make your life easier. So let's say if you use managed platform, you don't have to pay for the Egress fee, and that that's thanks to our partner, Usabi, to store the data. So, basically, data scientists who are not working at large scale companies, this is a cheap 4 times cheaper alternative to AWS and without accuracy,

that that they can store their data and then train a model, let's say, on Google Colop. It plays very nice with the community, with the researchers, and so on. But then once you go into, like, a scale engineering

group or, like, a larger company, and they have to worry about security,

compliance,

like, the encryption of the data as well, both in rest and the streaming process as well. That's where our managed solution works, where the customer can point to, let's say, their own AWS cloud. That's where we get into their internal VPC, and then store the data and then stream that, make sure that the data never gets out of their hands. So those are the benefits for the managed system of Hub, but at the same time, we don't limit them. Like, if you're like, a company, you want to use Hub, but you don't want to work with us, that's really great. Feel free to use it, put it into your s 3, and then just point to your s 3 and then start using it. And we have seen a lot of companies doing that.

Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo or Facebook ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on.

The data you're looking for is already in your data warehouse and BI tools.

Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems.

No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for reverse ETL today. Get started for free at dataengineeringpodcast.com/hitouch.

And then another interesting element of this

data storage layer that's specialized for machine learning is that there are a number of workloads in different organizations

where the same source data assets might be used for building and deploying a machine learning model,

but also in an analytical

context where you want to be able to gain some sort of business intelligence from those source datasets. And I'm wondering how people have been able to

synchronize or manage these 2 disparate workloads

where they might store their data as these preprocessed tensors in Hub, but then

also pull, you know, maybe some image metadata

out of the source files into their data lake or data warehouse for being able to do analysis and then be able to feedback any insights from their machine learning models into those locations as well to be able to merge those different paths.

1 of the features

or requests that we heard recently, we don't have that now yet. It's like a very high integration with Spark so that you can, as as you mentioned, get the metadata and push it and run on a Spark cluster and then get the insights there. Or let's say, is there a way we can synchronize or interface with data warehouse like Snowflake to pull the data into our format training model and then push back a new metadata back to the Snowflake? So that's another integration.

And we what we are doing now is, like, we are kind of prioritizing on the integrations work that you can do, and especially with more traditional workloads in data warehouses or, like, the current data lakes that you use. So we kind of want to treat us not as an alternative, but more as a kind of combo where you also additionally have these capabilities of storing and very efficiently working with unstructured data. But at the same time, what we have seen and what we are looking for is, like, can we make this whole process much more efficient, both familiar and also, like, efficient, and you can use new alternatives.

Let's say, another

kind of integration we are doing with Array or NSCAR Array, where you have the data in a distributed RAID storage, and you have also distributed compute. Can you marry them? And that's kind of new ways of running the workloads.

And that's where our primary focus is. After nailing down the format,

can we integrate with the rest of the ecosystem, make it as simple and seamless to use? 1 thing that I want to stress out you mentioned is that, okay, I have all these analytics use cases. What's the best format for that? That's how Parquet or, like, all other, like, data warehouses, like, based on SQL queries have been based on. There's no any anything similar for, first of all, for machine learning in general, but also, like, in deep learning specifically. Or let's say, is there a way I can query non structured data like images?

And the way you will do it is that, okay. I have, like, let's say, a 1000000 images. I want to find all the images that have a bicycle

running in front of the car during the night so I can subsample it and maybe train a model

separately. Is there a way I can run a SQL query on top of this and then generate the insights directly? Intel behind the scenes is that, oh, I have a model that extracts the bicycle and the time during the day and and so on. And I will run this model set scale very efficiently,

get this metadata, and then push the subset of these images back to the user in a very efficient manner. And that's where we see this becoming bank step. Once you're nailing down the format itself, can we bringing the data warehouse capabilities to this structure?

That also introduces

an interesting question as to who are the majority users of what you're building with ActiveLoop and Hub, where

in analytics use cases, there is a fairly well understood

delineation of the data engineer is responsible for sourcing the data into the data warehouse, and then you might have an analytics engineer who's using dbt to build out the analytics pipelines to propagate business intelligence systems, and there's some overlap there. With Hub, it seems that the end user is largely going to be the

data scientist, but that they still need

to rely on the data engineer to

but that they still need to rely on the data engineer to be able

to capture and organize the source image datasets. I'm just wondering how you see

the ActiveLoop platform and the hub framework

influencing

or supporting

the combined work of data engineers and data scientists.

So if you take the spectrum of, let's say, developers,

on the left hand, you have the developers, and then you have data analysts, and then you have data scientists, and then you have machine learning engineers, and then you have deep learning engineers, and then the deep learning engineers are 10000 order of 10000. Machine learning engineers are at the order of 50, 000, let's say. Data sciences are at the order of 100000 to 1, 000, 000, and then data analysts

go furthermore.

And we took an approach that's like, hey. Let's heavily focus on the right aspect of the spectrum and make this work very well for this small group of people so they don't actually have to, like, give them data engineering capabilities.

For example, when I was starting with training models, I had to learn a lot about distributed systems. I had to learn a lot about how the data should be stored, like how to build all these data pipelines in scale before I start doing the training process. That's where I'm experienced at. So to be very efficient, you have to be a PhD, essentially, in distributed systems and in deep learning and machine learning. And that's kind of like there are only very few people. Even us as a company, we're trying to build a tool. It's very difficult for us to find those people to join us to get started on this. And that's where our focus is on, enabling

the data scientists who work with a lot of images, don't interfere with data engineers,

and let them be like, give them the tooling that they can really very speedily

work on the data. Even when you're a data scientist and you're starting with a project, you as a data scientist has to spend a lot of time as a data engineer in the project. And that's what we are eliminating, not the data engineers. And then data engineers can enable them to run much faster for their data teams. But this is all very as you mentioned, is is still, like, getting figured out in the organizations

where they don't have all this kind of decided manner of how we're gonna work with the data as they have with, like, analytics use cases, where the business unit pings the data analyst, data analyst pings the data scientist, and then they get the, let's say, the integrations, the SQL query, the dashboard, and back to the business unit and so on. But this is not figured out in the unstructured world, and that's where we are kind of building the base for it. And then later, as we envision the company evolves and we get more usage on the adoption on the side of the format things that say, can we actually make this tool so generic that data analysts can use it? And furthermore, like, not only data analysts can anyone can just type a query

in text, and then it uses GPT. Let's say it transforms into a query that is, like, native to the tensors, run any model that you need to add any

additional metadata

and then bring back the query to the person who was just looking for the disease on an aerial image. That's where our vision goes into, but, like, we have to go there step by step. Another interesting aspect of what you're building at active loop is

that it seems to

obviate

the need for a number of different pieces of sort

of data infrastructure and, quote, unquote, MLOps infrastructure that has been gaining a lot of ground

in sort of the very text oriented

world of machine learning, thinking in terms of things like feature stores

and, you know, different stream processing engines. And I'm wondering what you see as the potential for expansion of active loop into

areas beyond just the vision oriented datasets

and maybe potential integration with these feature stores or

whether those

technologies are necessary in the world where all of your data is living as a preprocessed tensor for these machine learning workloads?

Not really. Like, let's say, is there a Kafka for answer to data? Or is there like, can we take Tekton and put it into images as well? So the problem with Kafka, let's say, is that, hey. If you have, like, 1, 000, 000 mobile phones and all of them are streaming videos, is there like, can you actually use Kafka or somehow, like, any streaming processing to sort this data? There are some core fundamental architecture changes you have to do there to be able

to enable video processing, let's say, because you have to minimize the copies, basically, of the data. Like, eliminate all of them, get to the storage for once. So but then there could be a potential symbiosis where you can take what we do at Xtool,

combine with what Kafka has, and then, hey. Now you can actually stream, like, a 1000000 videos

to a data lake or a data store, and then stream them then to do a training process on top. Or do like, there's a feature store, but then you can actually version control all all the deltas of your data. Having said that, which means the way we do forward is that if we can do an integration and that enables the data scientist or the use end user much more efficiently rather than us building this whole functionality, which is very complicated itself, then it's better to do it through on the integration. But then when we see those technologies need to do a core fundamental change, they also will be very reluctant to change their core to support additional types of data, unless this need is very big, and that's at least observed with Databricks as well. So having said all this, we really want to work with other, both open source and non open source, projects to expand the capabilities of and how the data for on such data specifically can be stored and then streamed,

processed, like, from, let's say, a 1000000

cameras all all over the world.

And at the same time, like, hey. This is gonna change another core functionality or core architecture, and maybe we can have an alternative to that and then build it quickly. You mentioned earlier that you currently integrate with PyTorch and TensorFlow

as far as the hub library and the platform integrations.

I'm wondering what would be involved in

expanding and generalizing the support

for other machine learning frameworks and other language runtimes to be able to take advantage of the

storage and interface optimizations that you've built at active loop? So with other deployment frameworks, it's so more or less, it's sort of straightforward. Like, let's say, we we have the MXNet

interest from few folks from the community.

Then can we actually take this and then put it into more, like the traditional world where we can integrate this with Hgboost or iGBM or other ML training frameworks,

and that should be very straightforward as well. Though, like, we would like to specify is that most of the benefits come actually because the data is unstructured. So that's where like, the streaming and optimizations can come into the place, the assumption that I mentioned before. But at the same time, yes, that's, like, totally possible. And given the interest, we can we can start, like, pushing more engineering resources on optimizing those use cases. So if 1 day comes and, like, oh, let's say that, I don't know, like, XGBoost is the most useful use case For Hub, then we'll put all our efforts to make sure that this use case can be supported as optimized as efficient as possible

for those users. They are, like, kind of more like if I if you nerd out here, like, from the research I remember at Princeton, they're, like, more RAM limit memory algorithms that take into consideration that, okay. I have, like, very limited amount of memory that I can store the data locally,

but I also have huge data. So I can't just run, let's say, training on a single model. That's what we want to envision for models. They they see the whole data at the same time, but they are sucked away basically from the compute limitations. Yeah. As you have been

building the active loop platform and building this open source ecosystem around it, what are some of the most interesting or innovative or unexpected ways that you've seen active loop and the hub framework used? The 1 use case I mentioned before is that, hey. Actually, we can store this data instead of storing in Postgres

DB, which was, like, kind of surprising. We were not optimizing for that. We were optimizing for something else. That was slightly interesting, but I wouldn't call it as unexpected as in terms of the challenging lessons that we learned, and we learned a lot of stuff. And especially when with our first iterations,

we didn't pay much attention to the user experience, and the users had to specify a lot of parameters to make sure that storage is always, like, performing as as we intend to them. Like, for example, 1 benchmark we run is that, hey. What if 1 of the customers ask us, the data could be stored in like, MongoDB starts supporting AR images. How what are you gonna do? And then we take MongoDB, we push a very small dataset, a couple GB dataset to a MongoDB, and then we take the same dataset, push it into hop on s 3, and then train a model on top. And what we show there, we can actually get the same

performance as the MongoDB,

but without any

without any database or without any in memory cluster machine that sources data to while you're training the model. For the hub 1.0, it took us 20 minutes to spin up MongoDB, and it took us, like, 8 hours to fine tune hub to make sure the performance is we we know as the developers we're using. Now with 2.0 is that we took and removed all these,

like, complications

and make it very simple to use. And because of this, we actually took a lot of core assumptions

in the data layout

that you will then see in other, like, similar or maybe other, like, products

that or, like, open source tools that we took for.

And those assumptions in

data engineering, like, they have consequences.

And those consequences, like, then alter all the architecture you're gonna build later and so on. So you have this butterfly effect

on making a core architecture decision on, like, how single bit should we start, that's, like, impacts the rest of the, like, follow-up, like, getting into, like, more like the end of the pyramid of all the data. So I think that's very interesting to get into these limitations of Kafka

in databases,

then how you to guarantee all the things that you want data centers to have, but then you're gonna have a lot of trade offs to do. That's from technological

things perspective that the

professor at the Berkeley, like, mentions, like, essentially, they they're just getting started both in Stanford and Berkeley to research on this topic. So it's like you nail down the databases for structured data, like data warehouses and so on. But for unstructured data, this is just getting figured out in research. Nobody has yet started publishing papers.

It's kinda challenging problem for a startup to solve, Because usually what you see is that either you already have a well established paper or published project and then take this into

a market, but you don't even know it during the start up. Like, you do it much later stages. Or you, like, you build a, like, mobile application. We just see something. That's the kind of, I'll say, a challenging lesson we learned.

Yeah. We are trying to figure out how the best we can combine both the innovation

and also, like, the kind of business requirements for building a sustainable business on top. So really, very interesting problem space. And that was 1 of the things I was noticing as I was getting ready for this interview is that you went straight from your PhD program into this business, and I imagine that that brought in a lot of interesting complexities and challenges and learning curves beyond just the technological problems that you're trying to address.

Yes. In that case, Y Combinator was very helpful. Like, kind of rebrainwashing

us from the PhDs to the founders,

I myself still have this problem. Like, still think too much too complicated.

And there are certain things you can do so snappy, like, without getting into very farm, like, building all the end to end perfect it could be a product. It could be, like, a plan or strategy or the sales motion and so on. Just, like, how can you simplify all this? Like, how can you simplify such that anyone can understand what we are doing? And we have this problem from the day 0. And we are still, like, in this process, okay, how can we simplify

and communicate with people,

like, the benefits or the value propositions that we do who are not actually doing data science or, like, general people who are not in the domain.

And I feel that that's 1 of the things that's challenging

from your coming with this all the technical background, but that becomes a limitation

for you to talk to more people. You may have already touched on this, but what is the most interesting

or unexpected or challenging lessons that you've learned while working on active loop and building out the technology in the business? We have been very focused on the technology,

and technology is not the main decide. I mean, it's very important, obviously, for such a company, but there are also other aspects you have to nail down, including the product and sales.

Number 1 lesson that we learned, I think it was at Y Combinator, is that in b 2 b, better sales mean better technology.

You still you want to have good both of them, but you also have to pay focus on talking to the customers, learning their real needs. And

instead of, like, taking this, oh, this is a challenging problem, Let's try to figure the solution for it. And I think that's the biggest lesson for technologists,

to listen to the customers, to learn from them, and then iterate on the feedback rather than on the internal insights. Though sometimes you have these exceptions, which are great, that actually work out with your insights, but it's like the probability is much less than

taking the input from the environment.

And for people who are interested in

being able to optimize their machine learning processes or they're dealing with computer vision workloads, what are some of the cases where the hub and active loop platform are the wrong choice?

Essentially,

we initially,

the problem was that if you have, like, very small data, like, you have m and a size, like, which is 28 by 28 pixel images, then you can actually store

in a usual database

and, like, a more structural form. And we haven't yet super optimized for the use cases, so that's, the kind of a flag to be upfront.

Our main kind of benefits come into the place when your datasets become bigger. Like, start seeing the benefits when you have more than, let's say, couple 100 GB datasets.

Like, the real benefit, like, the threshold I will say is that, hey. I no longer can store this data on my local machine or virtual machine on the cloud. I have to copy this data every time I start training. So that's where, hey. You can use Hub actually to not waste any GPU cycle or compute cycle or your time to copy this data there. So that's where our benefits come into the place, and they shine when you have petabyte scale data. So it's not currently 2.0 is not yet scalable to the petabyte, and we are working towards by the end of the year. But that's where we see ourselves

in narrowing it down for a larger

use cases. As you continue to build out the technology and the business, what are some of the things that you have planned for the near to medium term and any projects that you're especially excited for? I think the main core focus for us now is to really get

HubSpot option in organizations who are working with on such datasets, especially images or videos or audio coming soon and text as well coming

soon. 1 thing that we think is that this is gonna take into

is actually letting

users and data scientists to be very easily to navigate through the data.

And this navigation comes into 2 forms. First of all, being able to ingest the data as a human being so that you are able to visualize the data. And it could be really huge datasets, it could be small datasets.

Think of, like, Google Maps, but for actually any n dimensional data. Or, like, I don't know if you've seen the movie Interstellar, like, the last part where it gets into a black hole and you have all this 4 dimensional data. Can you actually give easy very easy tool that we are working on to visualize all these, like, complicated datasets and their combinations of them? The second capability

that we have been working towards

and thinking that this could be very useful, though it hasn't been proven yet, is how I can query this data, how I can process this data?

Like, can I quickly

search like, the way you do a Google search on the whole Internet, can I actually search over all all my data and also, like, instantly visualize this? In Ironman, you have this all these capabilities of being quickly,

locate the target you're looking for and and so on. I mean, it sounds futuristic, of course, but if you nail down the use cases you have with the companies and what they're looking for, that's where we see the most impressiveness coming into. Our bets are into, okay, can we make,

like, Snowflake for unstructured data? But then you also need Tableau for unstructured data. You have all these other tools. Either there will be other separate tools that will develop software and we can do an integration, or can we make this end to end user experience

very seamless and easy to use such that anyone only data centers can use it? Are there any other aspects of the work that you're doing at Activeloop and the overall space of

dealing with unstructured data and optimizing for machine learning that we didn't discuss yet that you'd like to cover before we close out the show? I also love to mention that we also have a podcast, Humans in the Loop, where we invite the different CEOs of companies who are AI AI core. So, basically, they build AI applications,

and we talk to them about their challenges they have, both from the business perspective and from the technological perspective about the data. We also have Slack community, and you can go out into slack.actable.ai,

or ping me on Twitter at dbonyatyan,

and follow, like, just our journey on Twitter at xwolfai.

I feel like what we are looking for is that more people

joining into the community, giving us feedback, their thoughts, how we can take the direction, and more what are the main problems they're seeing that we can help them to improve, and maybe also welcome the contributions from the open source side of things.

So that's, like, kind of a high level pitch here

to this community.

Yeah. I'll add links to all that in the show notes. And for anybody who wants to get in touch and follow along with the work that you're doing, I'll have you add your preferred contact information there.

And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

So every person listening to this podcast based on all that dataset management is not that easy.

In fact, it's like a big pain to process it and, like, to

solve these challenges

and deal with data pipelines and do all these problems.

But insights, we all feel like, oh, I want to go to train this model. I have to fine tune the architecture

production. But from the other perspective,

to get better predictions but get better models, Like, there's not that much you can do after, like, you went through all this, like, kind of all autonomous stages for the model itself. And we feel that the community and the researchers

have shown very great results on the models themselves, but because of this, we kind of lost the data centric approach that we had, let's say, 30 or 20 years ago, 10 years ago.

And we have been too much focused on the models, but less focused on the data. And not just the data itself, like treating as datasets

as leading organism. As you have, like, software 2.0 where the software

or the code generates another code to be able to solve the task, we feel that the same should be as well done for datasets, where datasets are organized in that they grow, they expand, they like, the models learn from the datasets and then generate more data out of it, and then you can

key have data 2.0 with software

2.0 in quotes

and essentially without the cost of the data scientist

brainpower

solving these problems.

Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Active Loop. It's definitely very interesting project and an interesting problem domain and definitely excited to see more projects and companies building out more optimized formats for simplifying the lives of data scientists and deep learning engineers. So I appreciate all the time and energy you've put into that, and I hope you enjoy the rest of your day. Thank you very much, Tobias, for the invite. Really great chatting video.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on iTunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links