Summary
Building a machine learning model can be difficult, but that is only half of the battle. Having a perfect model is only useful if you are able to get it into production. In this episode Stepan Pushkarev, founder of Hydrosphere, explains why deploying and maintaining machine learning projects in production is different from regular software projects and the challenges that they bring. He also describes the Hydrosphere platform, and how the different components work together to manage the full machine learning lifecycle of model deployment and retraining. This was a useful conversation to get a better understanding of the unique difficulties that exist for machine learning projects.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to dataengineeringpodcast.com/clubhouse today and signing up for a free trial. Support the show and get your data projects in order!
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Stepan Pushkarev about Hydrosphere, the first open source platform for Data Science and Machine Learning Management automation
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by explaining what Hydrosphere is and share its origin story?
- In your experience, what are the most challenging or complicated aspects of managing machine learning models in a production context?
- How does it differ from deployment and maintenance of a regular software application?
- Can you describe how Hydrosphere is architected and how the different components of the stack fit together?
- For someone who is using Hydrosphere in their production workflow, what would that look like?
- What is the difference in interaction with Hydrosphere for different roles within a data team?
- What are some of the types of metrics that you monitor to determine when and how to retrain deployed models?
- Which metrics do you track for testing and verifying the health of the data?
- What are the factors that contribute to model degradation in production and how do you incorporate contextual feedback into the training cycle to counteract them?
- How has the landscape and sophistication for real world usability of machine learning changed since you first began working on Hydrosphere?
- How has that influenced the design and direction of Hydrosphere, both as a project and a business?
- How has the design of Hydrosphere evolved since you first began working on it?
- What assumptions did you have when you began working on Hydrosphere and how have they been challenged or modified through growing the platform?
- What have been some of the most challenging or complex aspects of building and maintaining Hydrosphere?
- What do you have in store for the future of Hydrosphere?
Contact Info
- spushkarev on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Hydrosphere
- Data Engineering Podcast at ODSC
- KD Nuggets
- The Open Data Science Conference
- Scala
- InfluxDB
- RocksDB
- Docker
- Kubernetes
- Akka
- Python Pickle
- Protocol Buffers
- Kubeflow
- MLFlow
- TensorFlow Extended
- Kubeflow Pipelines
- Argo
- Airflow
- Envoy
- Istio
- DVC
- Generative Adversarial Networks
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With 200 gigabit private networking, scalable shared block storage, speedy SSDs, and a 40 gigabit public network, you get everything you need to run a fast, reliable, and bulletproof data platform. And if you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and 1 opening in Mumbai at the end of the year. And for your machine learning workloads, they just announced dedicated CPU instances where you get to take advantage of their blazing fast compute units.
Go to dataengineeringpodcast.com/linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers for engineers. Clubhouse lets you craft a workflow that fits your style, including per team tasks, cross project epics, a large suite of prebuilt integrations, and a simple API for crafting your own. With such an intuitive tool, it's easy to make sure that everyone in the business is on the same page, And data engineering podcast listeners get 2 months free on any plan by going to data engineering podcast.com/clubhouse today and signing up for a free trial.
Support the show and get your data projects in order. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Dataversity, and the Open Data Science Conference. And coming up this fall are the combined events of Graphorum and the Data Architecture Summit in Chicago. The agendas have been announced, and super early bird registration is available until July 26th for up to $300 off.
Or you can get the early bird pricing until August 30th for $200 off your ticket. Use the code b n l l c to get an additional 10% off any pass when you register, and go to data engineering podcast.com/conferences to learn more to take advantage of our partner discounts when you register for this and other events. And you can go to data engineering podcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers. Your host is Tobias Macy. And today, I'm interviewing Stepan Pushkarev about HydroSphere, the 1st open source platform for data science and machine learning management automation. So Stepan, could you start by introducing yourself? Hey, Tobias.
[00:03:01] Unknown:
Thanks for the intro. My name is Stefan Pushkarraf. I'm CEO of HydroSphere, machine learning management platform. My personal background is in data engineering, backend engineering, and, I've spent a couple of years working closely closely with machine learning engineers and, delivering the stuff to production and to life. So, that's my background. And do you remember how you first got involved in the area of data management? So I do not remember exactly. It's probably it was probably the earlier versions of Spark back to 2016, probably. Not not 16 or even, like, 14. I don't remember exactly. So it's, it's kind of all the all the software engineering, world and and just especially designing and building distributed systems evolved and, basically, major major stuff that we've been working on is our databases, data integrations. And, it was very smooth transition from the classical software, software development to so called big data software development. So it's, now it's it's not a buzzword anymore, but, early Hadoop and Spark days, it was very cool to deal with. And so
[00:04:19] Unknown:
for anybody who hasn't listened to it yet, you and I talked a little bit about your work at Hydrosphere. I think it was 2 years ago now at the open data science conference, and I'll add a link to the just, you know, to our conversation there in the show notes. But wondering if you can just start by giving an overview about what it is that HydroSphere is and some of the origin story of how it got started and your motivation for getting involved with it. Yeah. Sure. Sure. So back to
[00:04:45] Unknown:
2016, I guess, I wrote I wrote an interesting blog post on k k d nuggets. It was, the topic was big data science expectations versus reality, and it was kind of manifesto of what, we've been working on. And, it was some very, very high level just just some notes that, hey, guys. You're, like, talk hey, community. You're talking about the all that, benefits of machine learning and analytics, but, what is a reality. And, the key take there there were few takeaways out of the those blog posts. First 1 was tools evolution that, existing existing tools, in machine learning are great, but, not stable and user friendly.
And, we will certainly see a rye a rise of new tools that, will augment machine learning engineer and, data engineer, in the future. The second takeaway was education and cross skills. So when data scientists write code, they need to think not just about, obstructions, but need to consider the practical issues of what is possible and what is reasonable. So, and the third takeaway was improve the process. So DevOps might be a solution in terms of the machine learning life cycle, machine learning workflow. It was, a very, like, windy definition. And I was trying to play with the, some, like, maybe coin some name and define those, like, value proposition, getting some feedback from the community.
But if you, now, it may seem very, may seem very obvious for for people nowadays because, the community has grown and, big companies, the major players have evangelized a lot of cool cool and good stuff. So, but, anyway, the like, 3 years ago, it was kind of new, and, we were looking for a community for we're looking for traction and some feedback. And, the reason I was trying to share those notes and thoughts, yeah, we've been working on some cool projects, on a consulting basis, but our parent company, also part of Wizka is a machine learning consultants consultancy and solutions provider.
And, we have to deliver the stuff to production for our clients, to make it work 247. And, we have as far as we were we were, like, contractually obligated to deliver stuff to production, we had to invent all that stuff for ourselves, and automate our own process to to decrease our, like, overhead and to automate our routine work. So, we started just building some, DevOps like tools for not a continuous continuous delivery, continuous integration, but some, like, minor stuff that will help us to to to move faster, to close that gap between, like, training and retraining and make make things, more more user user friendly for our internal users, for data scientists and machine learning engineers. So we and, from that very wide idea of, hey. Let's do something cool and, let's automate our own process. We we, like, we we thought they it might it may make sense to find a new niche and out, outsource open source this project and start, like, a a new website, start a new GitHub repo, and start, like, talking to people and getting some traction. So that was the story. We evolved.
We after that, ODSC in Boston when we met first time, we got some good traction, and, it was kind of test of, test of the water. And, it was our very first public conference than than, where we participated in. I got, some good feedback. And, yep, since that time, we're seeing I think that that that space is is currently even categorized by Gartner. It's, it means that it's already here. It's already there. You should you should take a look at this enterprises and the big comp big and small companies should take a look and consider this, like model management in their day to day operations.
[00:09:36] Unknown:
And in your experience of building and maintaining HydroSphere and working on consulting for people who are getting involved in machine learning. I'm wondering what you have found to be some of the most challenging or complicated aspects of managing those model deployments and managing machine learning in a production environment. And if you can provide a bit of compare and contrast to how that relates to more traditional software life cycles of deploying and managing production environments for software applications?
[00:10:10] Unknown:
Yeah. Sure. So I would, I would rate the first, probably the most challenging part is dealing with exceptions, edge cases, and and certain blank areas of the model. Most of the data science teams are not there yet, but, anyway so it's it's something I I need to mention. Another, like, in a in a software engineering world itself and the as as well as in machine learning world, so dealing with state, with the model state. For instance, if your model is stateful, it's obviously really challenging to scale it, the horizontally and so on. So it's kind of another another issue. And, as far as you do not control this code, you cannot make it, really scalable enough. Dealing with APIs integration is it's also kind of, inherited from the software engineering world, but, slightly changed the slightly different in machine learning.
So even, like, array based, API versus named array is kind of such a small things, but, that makes, they make integrations really, really hard when data scientists exposed an an API, of 202 100 features in just an array, and they we should just remember the index position of a particular field to pass the parameters in and expect the same array as an output is kind of useless for software engineer who expects a nice JSON API with the names features named parameters there. So it's kind of cultural, but, usually, it creates a lot of, a lot of, like, chaos in in an integration pipeline. So, yeah. But, as I as I mentioned, but probably the fur the first and the main and major challenge at this moment is iterating iterating with the model and and dealing with the new concepts and new exceptions that you face in production.
Even I would I would like to mention the example. We've been working during last week on a on a nice demo for a conference with the hard hats. It was just a demo for, say, manufacturing conference when we you know, on a consulting basis, when we, provide safety control based on the computer vision for, manufacturing sites and construction sites. So you can if you wear a hard hat, then wear gloves and and so on. So we basically let you in or, or, deny an access. And while working on this, of course, there are a lot of, like, open source datasets, public datasets to train with, the, like, ready of the shell machine learning models. You can quickly, within couple of days, hack a prototype.
It can hack a pretty decent machine learning model that, can recognize person in hard hat and person without hard hat hard hat. It, it can even, authenticate that that person and and so on and so on. But, obviously, when we started testing this model, it's, really easy to fool the model. For instance, if you wear, if you just take off your hat hard hat and place it somewhere in your shoulder, it's easy. It's, the model recognizes you in a like, recognize it recognizes a safe access. So it's kind of and there are many other tricks you can make, obviously, to fool the model. So when we started to iterate, we started just adding and adding these, edge cases. And when you test it even within, like, a group of 5 people, it's really challenging to track all those edge cases and aggregate it into a new retraining batch and, like, basically automate this process to, to iterate iterate quickly.
So, obviously, that, like, deploying on HydroSphere, that our own tool made our process much, much easier. We even with this with this small group, we were able to, like, make the model reliable and stable relatively quickly. So, and, obviously, there when we will deploy the the model to production, there will be much more edge cases that we will automatically gather and, like, incorporate it into our training, retraining, and testing testing pipeline so we will not miss it in the future.
[00:14:59] Unknown:
And so for anybody who's not familiar with HydroSphere in particular, can you just outline the overall design and architecture of the platform and talk through the different components and how they fit together?
[00:15:12] Unknown:
Sure. Sure. So, yeah. By the way, we're o we're open source. You can check it out on a GitHub and dig dig a little. There is nice documentation. It's user user, documentation, but, we can dig in into the source code as well. So we are, primarily Scala guys, Scala and JVM guys. So Scala, aka, microservices, a couple of databases for different purposes, OSGres for, relational data, we store metrics in flux DB, we store, inputs and outputs of for of each model in a in a row format in a RocksDB. I'm not familiar. RocksDB is something that batch Kafka is a Kafka streams using in a on the back end.
So we have Docker obviously for, and the Kubernetes for, deployment and app orchestration. Couple of, like, Python microservices, CLI, and, Python runtimes for machine learning models. We we extensively use streams. We switched from Kafka streams to streams, just because, we need a dynamic, kind of, our models that could be added in runtime and within runtime and, and so on. And, the Kafka topics are not designed to deal with this this tie, this type of, like, this type of environment, this type of use case. How we use, AKA topics AKA stream. Sorry. But the yeah. In general, that's that that those are main keywords and main technologies we use. And as far as the different pieces of HydroSphere, I know that you've got the serving layer
[00:17:08] Unknown:
and then you've got the metrics layer for being able to determine the overall health of the model in production. So so I'm curious if you can just talk through maybe what the different concerns are for those pieces and how they tie together to fit into what the overall workflow would look like for somebody actually building and deploying and managing their model in production, and, how it simplifies the overall process versus what a homegrown solution might look like? Yeah. Everything starts with starts with, model cataloging and model deployment.
[00:17:43] Unknown:
So once the model has been built, we, basically hook into, the training output to the binaries produced, produced by training pipeline. It might be a pickle file. It might be, the protobuf file for Panzerfa or other format formats like, Keras, PyTorch, and others. We hook it, into this pipeline and basically upload the binaries to our service. And, we extract all the metadata from out of the that those binary formats and generate, generate additional metadata by, like, hooking into that training pipeline. We we capture training hyperparameters.
We capture all those, like, characteristics of training data, its distribution, its, all other statistics, and a whole bunch of other metadata that might be useful for, downstream applications to use and to trace to trace the request and trace the root cause of any other or any prediction. So we catalogize the the models. We build the Docker containers and store it. The Docker images are in in stored in Docker registry. And, for us, machine learning models are immutable, Docker images that could not be modified, obviously. And this is the way for us to do versioning. So our versions are not just an index and number in database. It's kind of physically packaged machine learning model with with all the dependencies and the and all the metadata.
So the the next step is basically serving. When by serving, we mean a deployment of Docker container, in a in a runtime environment. You can deploy the model in different type of runtimes. For instance, different versions of TensorFlow, different versions of, like, other libraries or just tweak, tweak some runtime parameters, to be to be more, like, to gain more, latency and to gain latency and so on. That's, that's a typical, typical microservices architecture. So we launch and orchestrate this, this Docker containers using Kubernetes API or Amazon ECS API.
It's kind of just well well integrated and well orchestrated platform, but we do not reinvent the wheel here. It's just a classical those classical, well implemented, architecture, microservices architecture. The second piece of our, the the second major part of our platform is we call it Sonar. So, Sonar is also a couple of, microservices that, shadow or mirror the traffic from the main, prediction pipeline and do all the magic with analysis of inputs and outputs, analysis of the model health, and monitoring the, monitoring the models, basically.
We monitor a lot of, yeah, basically, we monitor almost everything that that is related to machine learning model, and, it happens kind of automatically. We would generate a whole bunch of statistics. We generate a whole bunch of metrics, and we have a user defined metrics that, you can try you can basically monitor a particular output, for instance for instance, a confidence of machine learning model and assign a trash call to deal with and, to alert when your confidence of machine learning of of your predictions are below a particular threshold. Yeah. And those, like, sonar part is, is all based on streaming.
We analyze, so we calculate all the distribution. We calculate all all the histograms and data profiles on the fly. This is, this could this may not be explicit explicitly required by, by the use cases. I mean, that the model may not degrade so quickly, so we do not need a real timers here. But this design, consideration has been made just, be just because it's the streaming here is is much more convenient from the operational standpoint. You, you fail fast. If you don't if you cannot process data, you fail far faster. You you can easily recover or you can easily replay what you have not, processed.
It is, it is a concept of just fast data versus big data. So, if you do not process data, when it arrives, you will you will likely not press it process it, later. So, yeah, and there is a lot of IP, in that, like, sonar part, so we can there are some sophisticated, machine learning models like, like, January January for adversarial networks, some variations of path encoders that are designed, designed to profile and monitor your production traffic for any for anomalies, for edge cases, and, basically, to provide the insights into your further further iterations and, machine learning life cycle so we can retrain and, subsample and retrain machine subsample data and retrain machine learning models easily.
[00:23:53] Unknown:
And I like what you're saying about the fast data versus big data when it comes to managing the metrics for the models because I imagine that, you know, particularly depending on the problem domain that you're working in, but some of those alerts might be very time sensitive in terms of knowing when you need to, adjust the parameters for the model or maybe roll it back or revert it or rerun the training process to ensure that you can correct for some of that model drift. And so I'm wondering if you can dig a bit more into, sort of how you collect some of those metrics and the types of information that you're looking at for signals to let you know when you need to take some sort of corrective action for the models that you're managing and that are currently being served and, providing information back to end users.
[00:24:46] Unknown:
Yep. Yep. And this is, of course, there are different use cases. In in some of the use cases, you may not even have a a a data drift at all. For instance, there are if there is a classical collaborative filtering for recommendation systems, basically, the the training, in a batch mode that where the training is happening right before the prediction job. So it that's not the case. But, there are cases when the the there is a drift, but it's kind of slow drift that, your user behavior is being changed over time, and you can rerun some, recalculate some, metrics overnight.
That's also okay. But if you take, a user experience into consideration when you, deploy the model and you will you will need to watch, the model health for in the next, like, 10 minutes, the next, like, 15 minutes just to make a sanity check. Okay. It's working. It's working well. So we basically need the real time metrics. And, yeah, so, as I mentioned, we analyze it in Kafka in Agta streams. Sorry. Yeah. And, what we do, we trace, we trace, of course, all the all the route of the of the prediction request through the through the models. There might be much more than there is there might be more than 1 model, in the pipeline.
And, we actually do the each request with the, with the necessary metadata to be to be stored. And, we store all the requests and we store all the predictions for for the auditability and for discoverability purposes. For instance, if you have a high level metric that, hey, your some data distribution has changed, you will need, you will, as a machine learning engineer, you will need to dig a bit deeper and deeper to a particular request that, or sample, request that caused this data drift. For instance, in a very somewhat toy example demo, that we're running, that we're demonstrating, if you trained your machine learning models on classical MNIST dataset and, all of a sudden, you start sending letters instead of digits, you will need to get an alert then click on this alert, drill down into details, see, okay, the something is go going wrong and you will need to see a particular request, that caused that, that alert. Or the alert might be caused not just by a particular request. It might be caused by, like, a thousand of of requests, and you will need a a relevant user experience. So it needs to be done kind of near real time way. So and and then if you, so you have that, a a particular prediction, you will need, a chance to trace it back to machine learning model version to all the, to machine learning prediction and training pipeline to the original dataset that machine learning has been trained with for the same for auditability and for the ability to make a decision for re retraining or reconfiguration, or maybe hyperparameters tuning.
So, basically, each request has, a full, information about the, about the dataset it has been trained with, hyperparameters it has been trained with, deployment configuration, and all the metadata that might be relevant for this, for this particular prediction. That's kind of cool the main value proposition for for end user that we that we're proud of.
[00:28:58] Unknown:
And you highlighted a couple of things there that can contribute to the overall model drift or model degradation in terms of changes in the usage patterns for end users or changes to some of the input data. But I'm wondering if you can just talk talk through, in general, some of the factors that will contribute to that sort of model drift and, any type of contextual information that you're monitoring and alerting can feedback to the data scientists or data engineers or machine learning engineers to understand what what alterations to make in the training process to correct for that and just some of the overall contextual knowledge that's necessary to be able to engineer in, sort of resistance or, just tightening the feedback loop for keeping those models
[00:29:54] Unknown:
in proper working order and ensuring that they're doing what they're intended to do? I think the first and the most frequent use case is just not enough training data or not well architectured, that retraining pipeline. So, usually, data scientists are being provided with a dataset that they play with, and that's it. So they iterate theirs in their laboratory environment, and, then there is kind of training serving data skew. It's, what I see, it's 1 of the most from the 1 hand, 1 of the most simple reasons of, of it's not a degradation.
It's just, it's just organizational and architectural reason of having machine learning model deployed, trained with the on a bit different dataset that, than you expect, in in production. So and that's actually the partially, it's interesting, use case because, it's it obviously demonstrate that demonstrates that machine learning models and machine learning environment and production needs to be designed to be tolerant for this type of, use cases from the very, very beginning. Another typical reason of of inconsistency between training and serving, data is, just the enterprise wide environment.
When you deploy a particular model and you have more than 1 consumer of this model or 1 of 1 of more, like, department or 1 of or more teams that are using this model. And some of the teams may misinterpret the more, your API and will start sending just a different just will will start using this model just for a different use case. It's also, it's also the case. It's it's nothing nothing to deal with the concept drift. It's more just organizational and use use case drift. As an owner of the servers, you will need to be notified that, hey. Your your your end user is sending, is trying to, classify classify something else or is starting to extract the information from a very different text and from the that has been used for a very different domain, for instance.
Another use use cases that are just a really frequently changing environment, in IoT, in deployment of, like, a new deployment of new type of sensors, deployment of new type of locations. That's that's not really, really the case. And, actually, most of the computer vision and text text applications, they require this type of this type of iterative approach for discovering new and new concepts. And it's not adrift. It's kind of expected behavior. Hey. You're discovering. It's just a part of your part of your system. That's, that's, probably it from my side. Yeah.
[00:33:34] Unknown:
And another thing that I was interested in when you were discussing some of the metrics collection is what you were saying around having that be useful for being able to, sort of revisit the overall path and trace of the interaction of data flow through the model to understand what were the inputs and what were the outputs for being able to retroactively evaluate what the decision making process was like so that for cases such as GDPR where you need to be able to, say why a particular decision was reached in a machine learning context. And so I'm wondering, sort of how the sonar component fits into that type of regulatory environment or some of the other interesting insights that are able to be surfaced by going through that information.
[00:34:28] Unknown:
Yep. This is kind of must have feature of all the enterprise enterprises, that you need to store, all the predictions and you need to be able to trace back to the the origins, to the exact dataset that has been used for the model to train. And for I'm not sure about GDPR. I'm not super super familiar with that, with that particular requirement from g d GDPR. But, it it seems kind of, we have must have requirement for any enterprise that and not even enterprise. It's a requirement for any type of organization that is trying to deploy machine learning, models to production.
And the 1 thing is just to store requests and responses and save it to somewhere on the s tree. And another thing is to make it really useful for users, for business users and for, machine learning engineers to use it for either for discovery purposes, either for, like, audit purposes. Basically, you need to, to store it in a queryable format format and to be able to be visualized and gather some, like, statistic our statistics out of it. And if you will think about this, it will, if if you will think about this further, it will become your ground truth feature store. Well, with all the discoverability features, with all that nice stuff that will become eventually your main data repository within within enterprise.
So, it kind of flips the that that paradigm when you have a training pipeline kind of separated from the rest of the production. And it's just a just a query to your database, and the data scientists and machine learning engineers should place somewhere on the side with the data and train the models and then push to production. And, when you start, like, blurring the line between the that type of offline environment for machine learning experimentation and online environment, you will see that, the online always up to date, always, like, yeah, always fresh feature store that has been built right from the production traffic and which, that is well discovered, well indexed is kind of the the main asset you you you have, you have in enterprise.
So it's kinda been interesting observation that I see more and more, in organizations. So it's, we'll see where it, where where it goes to. And
[00:37:36] Unknown:
as far as the state of the sort of level of sophistication of different organizations and the available tooling for managing machine learning projects in production. Wondering if you can sort of give some background about where things were when you first started work on HydroSphere and how the overall ecosystem has evolved since then and how that has impacted your overall approach to the design and development of HydroSphere itself. Yeah.
[00:38:08] Unknown:
Originally, it was, as I mentioned, I've been writing some, like, blog posts, trying to coin the names and, trying to explain different type of people, that yeah. It's it's something that, is gonna be probably not not not the next big thing, but, it's definitely niche. Yep. Last year, we've seen Google has, published a nice paper about the missing gap in machine learning, ecosystem. Other great tools like, Kubeflow, Tethr, TensorFlow Xtended, MLflow, and many, many others, emerged. And this really helps to, to explain people, what it does and why it's why, especially the why aspect of, the things we're building. So it definitely accelerates accelerates the education and the the community growth for yeah. I mentioned Kubeflow and MELFLOW.
That's 2 tools has similar names, but slightly different, purposes. And, in in my opinion, they really add value to the, to that ecosystem. Probably, MLflow is, for me is is a bit more mature project driven by Databricks and, and Microsoft now. It has more value add contribution to the end user. So you can just for for those who are not familiar, it's mlflow is a is a tool for experiments tracking. And, yeah, it's 1 1 part the 1 major part, I think, of this of this tool is experiments tracking. So we can basically track all the old audio runs, all the experiments, all the metrics, also, and, all the outputs and, and inputs and your, like, hyperparameters and other stuff, in a collaborative way. So you can share it, with your manager. You can share it with your team. And, basically, also, very interesting, you can use it for auditability of your training training pipelines. So you can prove anytime what what what data your models has been trained with and so on and so on. It's a it's a Python based. It's kind of easy to start, easy to play with, has an has a nice, like, and very precise use case. Kubeflow, is a bit wider project. It started as the way to deploy TensorFlow on Kubernetes.
Obviously, like, Kubeflow is driving, driving users to Kubernetes ecosystem and, trying to become just to just own that territory of machine learning on Kubernetes. And I think the the main value add product that, they has released, recently is Kubeflow Pipelines. It's basically an abstraction, the Python DSL on top of Argo pipelines. Argo is a a tool on Kubernetes native GitOps tool. So there are a lot of buzzwords, but it's all all the new stuff that the community is learning now. So you can in a very, very simple way on a on a high level, you can think it as a a new Jenkins, Kubernetes native Jenkins, and for defining this continuous delivery pipelines and deployment pipelines. I will not go into details. So, of course, some, like, the, DevOps people and infrastructure people and Kubernetes guys will will disagree with me, but, there is something to discuss there.
So, yep. And and kube4 kube4 pipelines are kind of Python DSL on top of that. It is still not not as mature as as Jenkins, for instance, as Airflow, but it it is, and it is still not it doesn't add any particular value into machine learning, space. So it it's supposed to have a machine learning specifics, but, it has a little at this moment at this moment. You can you can visualize machine learning out machine learning, like, you can visualize your steps outputs in a way that is kind of, in a way it is well received in the data science community. For instance, you can visualize a precision recall. You can visualize a confusion matrix of a particular training pipeline training step and basically have it as an output.
You can have a a tensor a tensor board attached to to the logs of your training pipeline. So it's kind of a nice integration with the with those tools. But still, for instance, there are there might, some some some users, some some, like, experienced users that are were, like, has some experience with airflow type of tools, with, Jenkins Jenkins type of like CIC CIC tools. They will find it as a very, very new project. I think if you if you will take a look at their roadmap, they will be adding really cool features that are that are machine learning specifics, machine learning related.
So it's, you should stay tuned. But at this moment, it's kind of, I would say, it's more in experimentation phase. You will, you will write more code than you will you you still write a lot of boilerplate code to define Kubeflow pipelines, pass the parameters, in and and parse the outputs of those, of this, like, pipeline steps. Each step, obviously, in in flow in Kubernetes world is a separate Docker container. And, like, basically defining this the steps is kind of there's a overhead right now, a usability overhead. But, overall, I think it's, it's kind of it forces users to do it in the right way. By the time the Kubeflow community well, Kubeflow team will add more user friendly DSL and user friendly abstractions and the features into their DSL, it will become more useful for for users.
And, yep, for we we are really looking at this ecosystem, evolving ecosystem, and, building integrations with, Kubeflow and MLflow and also with some, like, prep proprietary cloud cloud, Azure and AWS AWS services like, like SageMaker, etcetera. So this ecosystem, of course, evolving. We're, we're part of this ecosystem and trying to be as much as close into this commune to this community.
[00:45:34] Unknown:
And as the availability of different tooling and the overall level of understanding and sophistication of practitioners in the field has grown in the past few years, I'm wondering how that has influenced overall design and architecture of HydroSphere and some of the types of tooling or platforms that are often used in conjunction with HydroSphere and how that might break down across the different sort of roles across the data team as far as data engineer versus data scientist or machine learning engineer?
[00:46:07] Unknown:
Yes. So, our own architecture has not changed, like, on a on a high level. High level architecture has not changed from the very beginning. We have, as I mentioned, replaced some, like, Kafka with, Kafka streams with. We've been using heavily Envoy and Istio, for, traffic management, and traffic routing between models and other micro services. We switched to another implementation recently with some, with our own, add ons into that process, just because, yep, the Envoy and the Envoy is not is not designed to, handle machine learning specifics routing when you need for instance, you need AB testing. We call it response based AB testing.
When you make it this, you route a 100% of the traffic to model a and 100% of the traffic to model b and then have a kind of aggregator that will not not aggregator, that kind of controller that decide what out what response to to return back to user a or b. So it's, like response based routing. And some other interesting features that are really required in machine learning, we switched to our own implementation of that. We started, like, we started even started contributing into Envoy, but then we saw that our like, it's so too far from what we need. And, and, the we do not have any, like, an influence in the community there, so it's, it was easier for us to just to build the features ourselves.
In terms of other, other, like, tools and ecosystems, of course, evolving. Yeah, cloud providers, they they are part of the innovation and Azure is doing a great great job on, also simplifying deployment and serving machine learning models. They also have some, like, monitoring stuff that might be comparable with the the the features that we offer, to some extent. SageMaker is a bit behind the this this monitoring stuff. They have not even I I don't, I don't think they have obviously, obviously, they thought about this, but they, to the best of my knowledge, it's not on their road map, at this moment. Yep.
Though the cloud providers are innovators, open source ecosystem, This is also cool. Like, data, datasets versioning tool tooling is, is kind of quite popular at this, this moment, and there's, like, a inter there are interesting discussions in the community. Naro, we we had the discussion to integrate with d v DBC. You said don't miss misspell them. This, datasets versioning tools. But, then we decided, decided that if you architect your data data platform in the right way so you have immutable datasets and all all the versioning of datasets could be done just by having immutability in your data. So, a query to immutable database dataset will should always return the same result over time. So and, basically, this query is a version of your, of your dataset. So we can it could be done without any additional tool toolings, and, now we use it extensively on our own projects and our own implementations.
So it's but it's also an interesting topic to to be discussed, yeah, how to version your datasets. We will, just just take away, yes, there are tools, Git like, Git, Git like tools for versioning datasets. But first of all, you can do it, with a sim with the, you can do it without a special, like, tool for that just by, proper architecture and design of of your data datasets and databases and, yeah, and and, having your training datasets immutable by design. And first of all, yep, if you still need it, there are great tools that you can that you can utilize capture all that interesting like, also capture all that interesting metadata and, store that metadata alongside with, with your dataset. And
[00:50:58] Unknown:
when you first began working on HydroSphere and got further into the problem domain, I'm wondering if you can talk through what your original assumptions were going into the project and how they have been challenged and updated in the process of building and growing the platform and getting more involved in the machine learning ecosystem?
[00:51:19] Unknown:
Yeah, of course. So, as I mentioned from the very beginning, it was, we were our design and our, like, main value proposition has been based just on our own experience. And, that was, that wasn't a purely academic research, of course. It was a real life real life integrations. But, the major kind of thing for us is a feedback from users. For instance, if we if we talk to the customers and they do not select our platform for their for their for their business for, by for for particular reason that we're missing some features or missing some, like, integrations. It basically facilitates, all that developments and all that, development slash, like, product slash marketing efforts.
Currently, what we've learned, yeah, this auditability and the observability of of machine learning predictions and the auditability of machine learning pipelines, including training pipeline and and and serving pipeline is kind of must have. This is a feature that, we haven't thought from the very beginning, but, we have added that recently. Also, the the technology is evolving. So, like, explainability of machine learning models, there are new tools and new research. Adversarial networks. When we started, there there was no generated adversarial networks, didn't exist. So we, we, we're basically scraping all the latest and greatest for from the research and, try and, which is applicable to our own our our main goal and main value proposition, for instance.
We do have, machine machine monitoring models that monitor, that like, data drifts. And, they are, like, neural networks. And, obviously, it's challenging to explain why a particular request is, has been, highlighted as an anomaly or as an edge case. And we added that explainability feature that we explain our own predictions, explain our own judgments, and, can, of course, explain it to the end user. Say, hey. We see you have some, like, you have a concept drift, and this is the reason of your concept drift that your particular feature has been is has drifted to that way, to to that direction.
So it's it's also cool and, we can really helps to our main mission and vision to provide, to provide the instant feedback loop for, data scientists and machine learning engineers. So he can iterate quickly in production, not just on research, but in production.
[00:54:37] Unknown:
And in terms of the future of HydroSphere and your experience in the space so far, I'm wondering what you have planned in the future for the platform and anything that you're keeping a close eye on in terms of ongoing developments within the community and the overall
[00:54:57] Unknown:
ecosystem? Yeah. So our our main goal for the next quarter is, integrations, integrations. And, yeah, well, there's something that, we're very feature rich, platform. A lot of features, a lot of, like, product features. What we're missing, currently is integrations with the cloud providers, integrations with, like, OpenShift, ecosystem, with, Kubeflow, and, you know, all that. So we will be it it includes not just not just, like, development of those integrations. It's also it also includes documentation, tutorials, and and blog posts, for that. So it's our main go to market
[00:55:48] Unknown:
strategy at this moment. And are there any other aspects of the HydroSphere platform or the overall concerns of managing machine learning projects in production and tracking and maintaining upkeep of the overall health that we didn't discuss yet that you'd like to cover before we close out the show? I don't think, we have covered pretty much every everything that I had in mind.
[00:56:11] Unknown:
We can we can talk we can talk about different topics, but I'm not sure it's it might be it will be relevant for this show. I have a lot of stuff on on streaming. We are, like, just streaming versus batch and versus and all that ecosystem, Kafka and stuff. It's not relevant at all to machine learning management. But, yeah, I would be happy to elaborate, on that
[00:56:37] Unknown:
as well. Okay. Yeah. I mean, that sounds like it could probably be a whole another episode on its own, so I'll have to, have you back on to dig deep into that whole problem space and your thoughts there. So for this topic, I suppose we'll call it a win. And so for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:57:11] Unknown:
Well, it's a it's a pretty wide question. So, the, yeah, just, everything that is related to production. And, well, I think we we have a pretty decent tooling and for for research. And, but even for research, you know, when you when you train your machine learning model and it's being trained for for, like, overnight and for a couple of hours, and, if your, like, performance is being, if if your training tracking and training performance, like, interface, but, with more precise and more metrics reach information, which may which may make even decisions on on your, on your training performance and maybe stop it or or or retune it. It's kind of a big big topic. There are cool cool projects and cool kind of vendors, that are, also looking into that direction. So it's all that meta learning and automated machine learning with the meta learning is is kind of cool stuff. Obviously, you've, even even when we train our machine learning models, usually, we kind of we try try to do to try to do it in automatic way. For instance, when you deploy when you deploy your main model, which we train some complementary models that will monitor your main model. And it it actually it's very challenging to do it in in a fully autonomous way. So you can just supply the data, your feature engineer your engineer feature automatically, and you and you basically choose the best model. So you can you should really put your training training pipeline on autopilot.
And this is kind of another there are leaders in this, in this space. And, but, I would see that, not not nothing, I I would see more in more tooling in in that will augment that training pipeline and make it, not fully autonomous, but will keep the data scientist in the loop, machine machine learning engineer in the loop, by providing really cool, like, dashboards and metrics. TensorFlow like, but with with more details and more and more information to to make a decision. And, and also supporting other different other frameworks like PyTorch and, and Keras and then others.
So, that's 1 direction direction. Yeah. And then another direct direction is, just managing the this enterprise workflow of accepting and, approving the models to to go to production because, we all like this agile style deployment of, deployment deploying stuff to production and managing it by engineers or machine learning engineers. But, of course, enterprise workflow is much more complicated. So you can, yeah, you can have some, user testing. You can have some, acceptance testing and, basically, verification of, by management team probably or by business users, hey, that this model is, yeah, ready to production. What I see but most of them enterprises have really, really required this.
This could be done this could be done with, classical software engineering tools, like CICD tools like Jenkins and so whatever. But, well, obviously, like, machine learning machine learning will have its own specifics and its own its own metrics to be approved by, by business users. So it and you see this is this already goes to the business users, not not to not to machine learning, engineers, level. So it's every machine learning model in most in in in some enterprises will need to will need to be approved by business users and also to to be compliant with all the regulations and so on and so on.
[01:01:54] Unknown:
Alright. Well, thank you very much for taking the time today to share your experience working on HydroSphere and the overall state of affairs for managing machine learning and production. It's definitely a very complex topic and 1 that is continuing to evolve as more people are actually getting to a point where they're leveraging these capabilities in their products. So thank you for your time and perspective and for your work on HydroSphere, and I hope you enjoy the rest of your day.
[01:02:26] Unknown:
Yeah. Thank you for the thoughtful questions. Have a have a nice day. Take care. Bye.
Introduction to HydroSphere with Stepan Pushkarev
Stepan Pushkarev's Background and Journey into Data Management
The Origin and Evolution of HydroSphere
Challenges in Managing Machine Learning Models
HydroSphere's Architecture and Components
Monitoring and Metrics for Model Health
Factors Contributing to Model Drift
Regulatory Compliance and Traceability with HydroSphere
Evolution of the Machine Learning Ecosystem
Impact of Ecosystem Evolution on HydroSphere's Design
Lessons Learned and User Feedback
Future Plans for HydroSphere and Ecosystem Developments
Enterprise Workflow and Model Approval