In this crossover episode of the AI Engineering Podcast, host Tobias Macey interviews Brijesh Tripathi, CEO of Flex AI, about revolutionizing AI engineering by removing DevOps burdens through "workload as a service". Brijesh shares his expertise from leading AI/HPC architecture at Intel and deploying supercomputers like Aurora, highlighting how access friction and idle infrastructure slow progress. Join them as they discuss Flex AI's innovative approach to simplifying heterogeneous compute, standardizing on consistent Kubernetes layers, and abstracting inference across various accelerators, allowing teams to iterate faster without wrestling with drivers, libraries, or cloud-by-cloud differences. Brijesh also shares insights into Flex AI's strategies for lifting utilization, protecting real-time workloads, and spanning the full lifecycle from fine-tuning to autoscaled inference, all while keeping complexity at bay.
Pre-amble
I hope you enjoy this cross-over episode of the AI Engineering Podcast, another show that I run to act as your guide to the fast-moving world of building scalable and maintainable AI systems. As generative AI models have grown more powerful and are being applied to a broader range of use cases, the lines between data and AI engineering are becoming increasingly blurry. The responsibilities of data teams are being extended into the realm of context engineering, as well as designing and supporting new infrastructure elements that serve the needs of agentic applications. This episode is an example of the types of work that are not easily categorized into one or the other camp.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.
- Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
- Your host is Tobias Macey and today I'm interviewing Brijesh Tripathi about FlexAI, a platform offering a service-oriented abstraction for AI workloads
- Introduction
- How did you get involved in machine learning?
- Can you describe what FlexAI is and the story behind it?
- What are some examples of the ways that infrastructure challenges contribute to friction in developing and operating AI applications?
- How do those challenges contribute to issues when scaling new applications/businesses that are founded on AI?
- There are numerous managed services and deployable operational elements for operationalizing AI systems. What are some of the main pitfalls that teams need to be aware of when determining how much of that infrastructure to own themselves?
- Orchestration is a key element of managing the data and model lifecycles of these applications. How does your approach of "workload as a service" help to mitigate some of the complexities in the overall maintenance of that workload?
- Can you describe the design and architecture of the FlexAI platform?
- How has the implementation evolved from when you first started working on it?
- For someone who is going to build on top of FlexAI, what are the primary interfaces and concepts that they need to be aware of?
- Can you describe the workflow of going from problem to deployment for an AI workload using FlexAI?
- One of the perennial challenges of making a well-integrated platform is that there are inevitably pre-existing workloads that don't map cleanly onto the assumptions of the vendor. What are the affordances and escape hatches that you have built in to allow partial/incremental adoption of your service?
- What are the elements of AI workloads and applications that you are explicitly not trying to solve for?
- What are the most interesting, innovative, or unexpected ways that you have seen FlexAI used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on FlexAI?
- When is FlexAI the wrong choice?
- What do you have planned for the future of FlexAI?
Parting Question
- From your perspective, what are the biggest gaps in tooling, technology, or training for AI systems today?
- Flex AI
- Aurora Super Computer
- CoreWeave
- Kubernetes
- CUDA
- ROCm
- Tensor Processing Unit (TPU)
- PyTorch
- Triton
- Trainium
- ASIC == Application Specific Integrated Circuit
- SOC == System On a Chip
- Loveable
- FlexAI Blueprints
- Tenstorrent
Hope you enjoy this crossover episode of the AI Engineering podcast, which is another show that I run to act as your guide to the fast moving world of building scalable and maintainable AI systems. As generative AI models have grown more powerful and are being applied to a broader range of use cases, the lines between data and AI engineering are becoming increasingly blurry. The responsibilities of data teams are being extended into the realm of context engineering as well as designing and supporting new infrastructure elements that serve the needs of agentic applications. This episode is an example of the types of work that are not easily categorized into one or the other camp.
Hello, and welcome to the Data Engineering podcast, the show about modern data management. Data teams everywhere face the same problem. They're forcing ML models, streaming data, and real time processing through orchestration tools built for simple ETL. The result, inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed, flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high memory machines or distributed compute.
Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI engineering, streaming, Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workloads, see what it can do for you at dataengineeringpodcast.com/prefect. Are you tired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to a factor of six while guaranteeing accuracy? Datafolds migration agent is the only AI powered solution that doesn't just translate your code. It validates every single data point to ensure a perfect parity between your old and new systems.
Whether you're moving from Oracle to Snowflake, migrating stored procedures to DBT, or handling complex multisystem migrations, they deliver production ready code with a guaranteed timeline and fixed price. Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafold to book a demo and see how they turn months long migration nightmares into week long success stories. Your host is Tobias Macy, and today I'm interviewing Brijesh Tripathi about Flex AI, a platform offering a service oriented abstraction for AI workloads. So, Brijesh, can you start by introducing yourself?
[00:02:39] Brijesh Tripathi:
Yeah. Hi. My name is Brijesh Topati. I'm the CEO of Flex AI. I grew up in India. Did did my undergrad at IIT Kanpur and then moved to US, entered my master's at Stanford. And from there on, I've been lucky to be part of some of the most amazing companies in in the Bay Area. I worked at Nvidia, Apple, Tesla, and Intel. And what I realized during my career is that I really enjoy bringing people and technologies together, and I really like solving tough problems. So couple of years ago, I was working on an interesting project, which was one of the biggest, largest supercomputer of the world at that time, Aurora, that was handed over to Argonne National Lab. I realized that there is something that is missing in the current offerings at the infrastructure level, and that is the ease of access.
We got the the systems installed, and it took us a very long time to get that handed over to some of our developers, scientists who were actually solving real world problems like climate change and medicine research. And from there, I got the idea that there is a layer missing in the infrastructure management that simplifies access to compute. So that's what Flex AI is all about.
[00:03:57] Tobias Macey:
And do you remember how you first got started working in the ML and AI space?
[00:04:02] Brijesh Tripathi:
Yeah. I think it's it's a long story again, but it goes all the way back to really dealing with the complexities of AI. But before that, I was, you know, I was assigned a role of chief architect for AI and HPC architecture at Intel. The first task I had was to figure out what is the right architecture for AI and ML workloads. And I observed that a lot of the new architectures that are coming out now had one deficiency, and that was they all focused on the highest performance, yet they were all becoming very inefficient. And while I was trying to figure all of that out, I was also realizing that we build this massive amount of infrastructure, and a lot of that actually sits idle. So combine the two things, I've realized that I think we need to change the fundamental architecture of how we design these things and focus on efficiency, energy efficiency, and then change the way we deliver these compute to our end users. And that led to an infrastructure pallet platform management layer, which is which is what we are building at Flex AI.
[00:05:10] Tobias Macey:
And so digging a bit more into Flex AI and what you're offering, I'm also interested in some of the ways that you're seeing those infrastructure challenges contribute to the friction that's involved in developing and operating these very compute heavy AI applications that have become very popular and prolific over the past couple of years?
[00:05:34] Brijesh Tripathi:
The chat GPT era basically changed a lot of the thought process on how small teams can actually be massively powerful and very efficient. So now you're starting to see one or two people companies that are actually building real good products, solving real problems. They actually are onto something. They have access to powerful models, dealing with very smart people, and they have this urgency to move fast. So they trade fast. They want to trade fast. But the challenge is that they all run into setting up infrastructure, maintaining infrastructure, and dealing with failures that happen left and right. So instead of solving real problems, these teams actually then become DevOps experts and and start dealing with infrastructure.
That is slowing them down, and that's that's actually one of the challenges that we are trying to solve here. Now the other thing that that's actually also interesting is when you're dealing with the current infrastructure and your access is through renting GPUs from either hyperscalers or these large, you know, new clouds like Colby, you have to now deal with unpredictable cost. And if you're trying to solve a problem that requires you to have a GPU for, let's say, a few hours, but there is no access for a few hours for a GPU, you have to now go block it for a week or a month, and that literally spikes the bill that raises the bill to a point where you can run out of money while just renting GPUs before you even finish building what you're building. And finally, as people rent GPUs, they be they go and use infrastructure.
Different places will end up requiring different setup. So if you go, compute from get compute from Azure, it will have a very different setup. If you get it from Amazon, it will be a very different setup. Google, same thing. And then these new clouds will offer various different settings like virtual machines or bare metal. All of that requires these very small teams who are actually not infrastructure experts to now have to go build the infrastructure layer on top of what they're getting. That takes a toll on these people. They they basically are spending less time building and solving their real problems, but they end up spending a lot of time just maintaining and managing these infrastructures. So, you know, as a founder, I actually feel for them. And, you know, I want to have a very small team, a scrappy team, and I empathize with these people. So sometimes when I deal with with these small groups of people, small teams, small start ups, and they realize the value of what Flex AI is doing for them, that makes my heart happy. So I'm I'm I'm really excited about helping some of these, you know, founders, my peers, who are trying to solve some real problems, get rid of the infrastructure problem they face every day.
[00:08:22] Tobias Macey:
Particularly with a focus on those founders and projects that are in the early stages of proof of concept or trying to figure out the appropriate product market fit, what are some of the ways that that friction and the cost unpredictability at the infrastructure level can prevent them from being able to scale at the pace that they would otherwise be able to if that cost was not a factor?
[00:08:50] Brijesh Tripathi:
Yeah. I mean, if look. Iteration loop, the number of times you can actually go experiment with your model, improve your model, that's the key there. And if you're dealing with the infrastructure challenges, that means you're not iterating fast enough. That slows you down. If you slow down that pace that someone else can catch up and actually do something better than you. So losing your edge by just reducing the number of iterations or the making the loop longer is how they are struggling. And one of the things that Flex AI does is it takes the friction away from these, startups. You are just focused on your model iteration or your application iteration or your agent iteration, And rest of it is taken care of by Flex AI, and that means you're going much faster than what you would do without us.
[00:09:36] Tobias Macey:
And to be a little bit hyperbolic on the infrastructure piece, isn't that what Kubernetes was supposed to solve for us? You could just write your code and deploy it into Kubernetes, and it would run everywhere that you had a Kubernetes cluster. Right?
[00:09:49] Brijesh Tripathi:
So that's ideally true. If it actually worked periodically, that was the promise made. But, unfortunately, the dependencies, the libraries, the overall complexity of actually starting from x number of n number of Kubernetes implementation. Not everybody offers the exact same abstraction. It just doesn't work the way we we we expect people expected. And what in fact, that is actually one of the premises we have, which is we start from a Kubernetes layer that is now consistent. So our developers don't have to deal with the differences and the complexities of missing libraries or missing dependencies. It just works. But, fundamentally, you're right. It was supposed to solve all of this. It doesn't. In fact, one of the things that we are doing to bias is where in near future, we are going to offer our customers an ability to bring their own containers. And those containers are then okay to run on multiple places. That becomes a way for us to handle custom flows and flows that are not currently supported by our platform. However, it still requires a very stable underlying management layer, underlying orchestration layer so we can get better utilization. You know, if you get Kubernetes to work, it doesn't mean that you are now accessing resources that are dedicated to you. It means that you have to carve out those things out, and that means it's still dealing with a bunch of idle cycles and wasted resources.
[00:11:18] Tobias Macey:
One of the aspects of running, in particular, these very compute heavy and GPU heavy workloads on a Kubernetes environment is that with CPUs, as long as they're the same architecture, it doesn't matter if you're running Intel or AMD. But if you are then in the GPU space and you're running between an NVIDIA GPU or an AMD GPU or a tensor processing unit, you need to have different drivers in place, which then also brings in questions around the software support in the code that you're writing, whether you're using CUDA or RockM or some other library that is able to take advantage of that compute accelerator.
And that also brings in another layer of complexity and maintenance and management at the software level that is directly tied to the underlying hardware that you're working against. And I'm wondering how you're seeing teams address some of that complexity and some of the ways that you're thinking about that problem and maybe offering some sort of, you know, hardware, software abstraction layer to be able to paper over some of those or and then also that brings up the question of potential loss of efficiency when you do have that translation layer because you don't have just a direct throughput to the underlying hardware because you have to do that translation.
[00:12:39] Brijesh Tripathi:
Yeah. Yeah. Yeah. Otherwise, that that's an amazing by the way, that's a very long question, so the answer is going to be long. But that's really the crux of what holds back a lot of alternates compared to what NVIDIA has to offer. So, you know, if if you are writing a model, a lot of times, you will end up having very explicit CUDA calls in there that that now go target a specific NVIDIA architecture. And in fact, the way it is implemented, you know, if if you're running on a 100 and you move to h 100, there is no automatic gains that come from a new architecture even on NVIDIA. So it's so specific to a specific hardware model number that, you know, ignoring some small, updates, every generation requires a bunch of rewrite and a bunch of rework of the models, which is within the same architecture. So NVIDIA and AMD definitely doesn't work the same way. We have to deal with RockEm versus CUDA. What we are trying to do is actually take away that complexity from from developers. And and the market is actually kind of helping in in that sense. We are focusing on PyTorch. So PyTorch is an abstraction layer above CUDA where a lot you know, you you've seen a Triton layer being introduced in the Python's work, which which is trying to create a unified model or programming model for, multiple GPUs and and both AMD and NVIDIA supported in there. But at our level, we are actually staying with CUDA on NVIDIA, but we we have a code analyzer that will look at what your CUDA calls are and suggest replacing it with a more universal code that can also then go run on, AMD. In some cases, we will even offer our developers, our customers support to port their code to different architectures.
Now these are things that that are quite relevant for training. And, you know, as the market is moving more towards inference, we have made a decision that all of our training currently runs on NVIDIA. And so we focus on NVIDIA only with the training. However, as we move towards inference, we are moving away from specific, architecture optimization, and most of these inference solutions endpoints are being called with a standard open API OpenAI's API. Once we go into that level of abstraction, the implementation details per architecture is actually controlled by us. The developer or the user doesn't have to know which architecture it's going to run on as long as we deliver the same API and the same results, And that's helping a lot. In the next couple of years, we believe that, you know, a pretty big majority of workload AI workloads is going to be inference. And, hence, we are seeing a lot more multiple architectures being at play here. In fact, we are already now deploying AMD for inference and 10 store for inference. And in future, we are also considering looking at Trainium and a couple of other new architectures for delivering inference.
[00:15:48] Tobias Macey:
The workload distribution was the next thing I was going to ask about, and you preempted my question a little by stating that you're focusing on inference. But to your point, inference is not the entirety of AI. It's just the thing that is gaining all of the attention right now. And so, obviously, the serving of a given model, whether that's a generative AI model that's focused on language outputs or a generative model that's focused on vision outputs that has a certain architecture, although even that is in flux right now. But regardless of whether it's a generative model or a predictive model that's using something like a random forest or an x g boost, the training time compared to the serving time are going to be, orders of magnitude difference.
And so I'm curious how you think about that aspect of the platform that you're offering in terms of being able to optimize for training time and iteration speed for being able to manage the experimentation loop and then the deployment and serving of those models where even if you are using a foundation model off the shelf, there are often cases where you want to be able to do some sort of post training and fine tuning of that model before you actually start serving it. And so I'm just wondering if you could talk to some of the ways that you're thinking about the overall AI life cycle as well as the expansion beyond the transformer attention based large language models that have gained all the popularity, and especially as we start to look into things like, diffusion based models and mamba architectures, etcetera?
[00:17:25] Brijesh Tripathi:
Yeah. That that that's, again, a really good relevant question for the time, Wes. Couple of things to start with. Number one, I believe that the world is going to be a heterogeneous compute world in the future instead of a homogeneous single architecture. And the and the way I maybe it's a simplifying thing for myself and and our implementation right now. But the way I think about this is every stage of this development starting from pre training to training to fine tuning to, you know, implementing a a deployment model, whether it's a rack thing or, you know, you know, some other ways of MCPs. Every stage is going to need or benefit from having somewhat customized and specialized architecture.
And to us, to to Flex AI, our offering is going to make sure that the workflow doesn't really have to fundamentally change because we are switching which architecture is being at which stage. So what we are doing right now is for pretraining and training, and in fact, even fine tuning for some at some level depending on what the size of the cluster that we need to put there in, is we focus on NVIDIA because, you know, a lot of these models have been developed and are being developed with NVIDIA. So it's literally just simply too hard to actually start porting all of those models and giving a new architecture, you know, multiple years of, you know, time to develop and optimize and become competitive compared to what NVIDIA has to offer. So we kind of just say, to simplify the world, we have we have multiple things that we do on the pretraining and training side that still improve access and quality of training.
We take care of failures automatically. We call it self healing. We take care of replacements automatically. So if you one node fails, we basically automatically replace it with another. We take care of scale. We take care of, checkpointing seamless checkpointing in the background without having to worry about wasting cycles. All of those things are actually already optimizing the run time or or the quality of the model and the iteration cycles that that are being developed. But as we move towards the deployment and serving these models, we start getting into multiple architectures and multiple options, which we believe some of them are better than what if we stayed with just one architecture or one, company are are possible. So we just we just take a look at every stage of the problem, every stage of the workflow, and pick the right answer for the right workload.
And the end result is that you get the best of all words, not just two words. We we are even able to put ASICs in there for certain use cases. So there are some customers who are looking for edge deployment, and the cost will not allow them to put this on NVIDIA or AMD. On the other hand, if you are deploying it on edge, the the performance and the power is actually so limiting that you have to now serve it on a much smaller much smaller SoC or an ASIC, and we are actually even enabling that in our future. So, fundamentally, we believe that you should be able to select the right architecture for the right workload. And even within a workflow, you should be able to do that without worrying about having homogeneity across the board.
[00:20:48] Tobias Macey:
You mentioned that workload as a service, which is something that is core to your offering and the the framing that you're presenting as far as the overall capability. And I'm wondering if you can dig into that abstraction and how that simplifies some of the approach of going from I have an idea to I have deployed an operating system that the workload as a service helps in accelerating that path to deployment and reduces some of these infrastructure concerns and the variance and rapid evolution of the ecosystem that we've been discussing.
[00:21:25] Brijesh Tripathi:
Absolutely. Yes. Look. We are evolving from platform as a service to GPU as a service. And the last couple of years, this was the buzzword, GPU as a service. What it was was plain and simple, renting GPUs that only you have access to. You start renting with a renting a GPU. You have to now go build a stack on top of it to make sure that it can run whether you're trying to run training on it, pre train, fine tuning, or you are trying to deploy an inference server on it. And everything will require a very different setup. Everything will require you to go get the right libraries, get the right dependencies sorted out. At the end, generally speaking, it used to take teams days to weeks to set up a new infrastructure to start a new workload and then tear it all down and then put a new configuration on it to then run new set of workloads. And that was massively time consuming. So while you need to do that to be able to actually build what you are building so you can deploy some of your interesting ideas that you come up with, you were wasting a ton of time just setting it all up. So what we are trying to do with workload as a service is you worry about coming up with interesting ideas and deploying them, and we take care of the rest of it in the back end infrastructure management. So all you have to think about is what your idea is and what problems you're solving and not worry about the maintenance and management and, you know, keeping track of all of the heterogeneous compute that might be actually useful or part of your workflow.
[00:23:04] Tobias Macey:
To your point of renting a GPU being the default method of gaining access to these compute accelerators, you then have to figure out, okay. Either I find the smallest GPU I can get away with for my model, and then you have a very fixed ceiling in terms of what you're able to do, or you say, I don't know exactly what I'm going to do. I'm going to find the biggest GPU or cluster of GPUs that I can find, and then I have to figure out how I'm actually going to maximize usage of it where maybe I have a fleet of LLMs that I'm serving for a chat based application, or maybe I'm building an agentic compute system. Or maybe I say, forget about renting GPUs. That's too much upfront capital. I'm just going to go the API approach and use a metered inference endpoint, and that has its own unpredictability and cost, potentially even more so. At least with the GPU, you know what the per hour rate is, and you you can understand that, but maybe you're not going to have the flexibility to manage the capacity that you actually need. And I'm wondering if you could talk to some of the ways that you've developed your platform and your architecture to be able to help smooth over some of those load patterns and get the sort of economies of scale as well as some of the ways that the unpredictability of these AI workloads complicates the work that you are doing to be able to manage that overall compute infrastructure and smooth out the those load patterns across the fleet of GPUs that you are providing.
[00:24:36] Brijesh Tripathi:
Again, really good question. And and, Tobias, we see this topic come up in lot more customer conversations these days than actual access to compute. You know? A lot of people are able to get compute these days, but their cost is becoming a real sore point. Recently, one of the, you know, favorite companies, Lovable apparently, showed up with $100,000,000 ARR, but the cost of delivering that $100,000,000 was almost $87,000,000 or something like that. Like, it is just gigantic compared to what they are actually able to make out of these, GPU. So we definitely are building an architecture that allows people to smoothen their, you know, peaks and balance it with with a little more averaged, you know, overall capacity required. However, we also have this ability to give you more compute for lower cost when there is less demand. So it's it's almost like you know, think about how Uber would do, you know, price, surge surges surge pricing based on what the demand is. But if you go middle of the night, you know, there are no drivers. You just have to pay high. If you go, you know, peak hours where there's too much demand, even though there are drivers, you pay higher prices, and then there are the average times. For companies that are want that are trying to get a sort of an SLA for their customers, they end up choosing the highest level that they will ever expect to hit, which means most of the time, they're sitting, with idle infrastructure, and that's a huge cost. For companies that are trying to optimize for cost, they run into quality of service issues. We know at the peak hours, you know, your responses for your chatbot or or your queries is is gonna be hitting super high latencies.
What we provide is a combination of access across multiple clouds and multiple architectures. So let let's just take a case of enterprise who's trying to serve, you know, a chatbot or, you know, support system using some model. And they see these variations in in their usage and their throughput, and they put some compute capacity on prem because that's a little more in control, and they they control the cost. But it is below the peak that they can actually hit. So what we do is if we bring their compute under our platform, we also connect a whole bunch of capacity outside of their infrastructure that is now available to them instantly. So the moment they get out of their own capacity, we have agreements with cloud providers with new clouds that we we can take their capacity and start running the workload from the customer.
And there is a small delta between what the cost is going to be because it's an on demand capacity, but it's still lower than that company putting capacity that was actually going to meet that peak. On the other hand, somebody puts a capacity that is significantly higher than what they need normally. What we then do is we actually orchestrate various workloads using multitenancy and shared GPU concepts such that more than one customer, more than one user can use the same infrastructure at the same time. And that's another thing that, you know, CPUs have been doing this for a while. Right? I mean, virtual machines multiple virtual machines can run on a single machine until you run out of capacity, then you auto scale a new new node, a new server. GPUs weren't doing that, and we just started enabling that in our platform where you can run side by side a training workload and an inference workload on the same infrastructure.
So given where the demand is from inference, the training can actually auto scale to a different size if there is, you know, a a QoS there that allows it to be downgraded a bit. Right? If you're doing, you know, fine tuning and you were letting it run for a few days or a, or a week or so, you know, slowing down during peak hours is not going to kill it. It's not going to affect your business too much. And that's a beautiful way of actually maximizing utilization of your existing capacity. So these are things that we are actually enabling right now. On the other hand, there's this last thing that we are adding, which is actually very exciting for us, which is we were able to put capacity, you know, let's say, an on prem capacity on NVIDIA GPUs, but there is inference that we can then launch on a different architecture.
When you run out of your own capacity, we can just move that to a different architecture and scale it over there while minimizing your cost. These are things that were not possible before, but with our platform, we just seamlessly connect the users, the customer, and their workload scaling to multiple clouds, multiple architecture at any location.
[00:29:30] Tobias Macey:
That speaks to the overall requirement for effective orchestration in any AI application, whether it's just being able to orchestrate the deployment of the model to the GPU or being able to optimize for cost because maybe the workload isn't interactive and user facing. It's a more long running agentic workflow, and so you can schedule the execution of that agent to those cheaper times, lower traffic times. And curious if you can talk to some of the ways that your workload as a service architecture is incorporating these orchestration concepts to be able to effectively manage the load distribution and identify workloads that are time sensitive or interactive required versus the ones that can be more asynchronous or are long running and able to take advantage of these distributions of load?
[00:30:27] Brijesh Tripathi:
Yeah. Absolutely. And and, you know, the concept is not new, Tobias. You know, if you have studied computer architecture, you know there is a whole concept of scheduling that the entire purpose there is to make sure every cycle of the CPU is busy. Why is that? Because that's the most expensive resource you have in the system. So your entire job in scheduling is to make sure every cycle is being used and used for the right purposes. But there's always a priority in there. One, the number one priority is based on the the actual, sequence in the queue. Right? You you pick the first one first. However, there are priorities attached to interrupts and high priority tasks, real time tasks that get picked out of the queue and executed much faster ahead of something else. So we apply very similar concept in our workload orchestration, and we let our customers define their priorities for various workloads. So they can say this is real time, and that means no one can interrupt them, and it will actually get ahead of the queue as long as there is no other real time thing ahead of the them themselves.
They can set the priority to be, give me the best effort. Best effort means I will find the cheapest and the most inexpensive resource and put it in the queue behind everything else. It will run with every other real time and high priority task is completed. There are some that actually are not real time, but they still have a priority on them because it affects the business objectives. For example, if you have a training that runs in one day and if it ran over two days, you might actually have business losses. So in those cases, they don't get preempted, but they actually are getting similar priorities as, you know, real time, except the real time will always trumpet anything else because that is the priority zero. Then there's priority one, then there's this priority two. Depending on what the business objectives are, you can continue to set those things up there. Now here's here's an interesting thing that we discovered and then implemented recently.
As most training jobs go, there's always a checkpoint happening. And when we were checkpointing, we all also got an opportunity to move that workload on the side if there is something else to preempt it. Interestingly, the cost of checkpointing has significantly gone down in the last one year. So now we do it seamlessly without waiting, without pausing the workload. It happens in the background pretty much zero cost to the workload. Every checkpoint, we get an opportunity to see if there is a there's a need to preempt it, go get a more high priority real time workload executed, get out of the way, and then go back and start running the same workload that is a long running workload that we previously checkpointed. So these things are actually now becoming very effective in maximizing the utilization of the existing infrastructure and resources we have access to. And that's something that I'm actually seeing a lot of people actually talking about, oh, we used to have 30% utilization to now. We are pushing our utilization to almost 80%, which is which is pretty good. One anecdote on this is recently, a few weeks ago, we had, you know, I call this the magical point, which was a proof of all of the things that we have been talking about. We had multiple customers on our platform.
We had multiple compute clusters from various providers on our platform, and we have multiple workloads on our platform. Different GPUs consuming 100% of it. Everybody happily running for moving forward, and everybody combined using 100% of the compute capacity available. That was beautiful. That was beautiful. I posted this picture on LinkedIn just because I was so happy about it.
[00:34:12] Tobias Macey:
And as you have been developing and iterating on this platform and the workload onboarding customers, what are some of the ways that the overall scope and implementation of your system have evolved from when you first started thinking about it and building the initial versions?
[00:34:31] Brijesh Tripathi:
Yeah. Good good, really good question. I I think when we started on this journey, you know, towards the beginning of 2024, large language models were still being built. There were many, many models being built at that time, But the pace has slowed down, and I'm gonna expect a a bit of consolidation as we move forward. That size of the models is growing pretty rapidly. So there are going to be people who just fall out and not be in the race anymore. But what it what is happening because of that is the efficiency for large language model training is constantly improving because now it's consolidated in one or two places or maybe five companies working on very large language models, but they all spend enormous amount of time optimizing infrastructure, the model architecture, the data flow, failures. Everything is actually now under scrutiny, and everybody is looking for even the half percent improvement in utilization and and and speed.
We didn't have that when we started. So early twenty four, we were still seeing 30% utilization of very, very expensive resources. I'm already starting to see even outside of, like, say, I some improvements on that utilization factor. However, when we started, we were thinking training is actually going to be a a lot more ubiquitous thing than than what in the it ended up being. So we didn't really spend a lot of time on inference and, you know, model serving. But as as we spend time with our customers, we realize that model serving and inference is actually running through the same inefficiency.
You know, as we were talking about people either book capacity to reach to hit the peak, and then they sit idle on that, or they're struggling with the response times during peak hours. And we realized that this concept of capacity versus utilization is very similar for even inference. And that's one of the things that we have evolved from the time we started to now. We now pay a lot more attention to smaller workloads, but the spikiness is pretty high. And we want we we divide the workloads in a way and run the agents in an in a way that it auto scales to make sure that we are maximizing the due utilization of the capacity that we have. There's an there's something else that that we also realized, which is the the focus on multiple architectures wasn't there when it was just the training that people cared about. When they were all building models, they all raised millions and millions and billions and billions of dollars and just, you know, focused on NVIDIA. But as as they are moving towards inference, we are seeing lot more opportunities of offering heterogeneous compute.
And now it's not just about offering an API on, you know, a grok or a Cerberus, but and and, again, if you stay in that paradigm, then you deal with all sorts of complexities with the actual platform. You know, you will be on one architecture for training, then you move somewhere else, then you move somewhere else. For Flex AI, we just remain consistent with the platform. So your data doesn't have to be copied over. You don't have to move anywhere to a different platform. Your experience remains the same. You start on one architecture. You end on another, but you feel that experience is one platform. And that's actually one of the advantages that we are bringing, which is one are one platform, multiple workloads, multiple clouds, multiple location, tailored and optimized for your needs.
[00:38:05] Tobias Macey:
And so for people who are onboarding onto the Flex AI platform, I'm curious if you can talk to some of the interfaces and abstractions and workflow that they should be aware of and just the overall process of using Flex AI for going from, I have an idea that I want to experiment with to I am now serving this at production load in an auto scaled capacity.
[00:38:31] Brijesh Tripathi:
Absolutely. The first thing is, you know, I my initial thought was it should be as simple as, you know, one click to two clicks. Right? Make it a user interface, a GUI that is super clean. And anyone, even without an AI or ML background or infrastructure understanding, can can come in and deploy something. Obviously, we work with a very, you know, technical group of people. So, you know, a lot of them actually already have workflows and they run, you know, their own containers, their own, you know, virtual machines. We enable them through a CLI, command line interface. So we allow people to launch their workloads whether it's training, fine tuning, or inference using a simple command line for any of the models you can pick from the open source world, point the platform to your data, and we take care of the rest. None of that, setting up the infrastructure, defining it, you know, how to set up the networking and defining where to set up that storage and s three buckets.
But we also simplified things for people who do not have that level of experience or have built AI applications before. We actually started something called blueprints recently, where if you are interested in building something like a smart search, voice transcription, multi cloud migration, media image playground, we just enable that by giving people a blueprint of how to build these things. So we pick some standard models, open source models from Hugging Face. We have some dummy data or some reference data that that we have used from open source world, and we build some of these applications for them to use for their own use case. So now they can pick a different model instead of the four models that we we picked. They can bring their own data, which they can use for fine tuning on or rack. And then depending on what their applications is, they can actually have multimodal input and output. And all of that is driven by a single platform where these blueprints are already available. You can just go in and click click and change some things and get get exactly what you wanted.
One of the
[00:40:37] Tobias Macey:
perennial challenges for anybody who wants to build a platform as a service and provide a unified experience for their customers is that there are always going to be some edge cases or some assumptions baked into the platform that are contrary to the assumptions of the team who are building the solution. And so finding the appropriate abstractions and interfaces as well as providing escape hatches for those edge cases is a path towards success. And I'm wondering if you can just talk to some of the ways that you're thinking about addressing those affordances and escape hatches and the aspects of the overall AI deployment and development ecosystem that you're explicitly leaving out of scope in order to be able to focus on your core competency of infrastructure abstraction?
[00:41:27] Brijesh Tripathi:
Absolutely, Tobias. That's that's a great question. And if we realize that, very early on that the world is not going to be homogeneous even from a philosophy perspective. Every customer that we talked to initially had a very different view on what their workflow would be, and we had to make some decisions because, you know, you can't serve infinite number of combinations and permutations. So so we had to make some choices. We did some make some choices, and the objective was that we serve a pretty large majority of the market. And one segment that we explicitly said we we cannot or we shouldn't be targeting is someone who was already deeply expert in infrastructure management. And they were just about turning to become model people and and trying to build some models, but their background and and history was infrastructure management. The amount of custom code and micro optimizations they put in their flows is just impossible to manage and and support because these are sometimes their proprietary flows, and, you know, we we don't want to get in the middle of their business. Having said that, yeah, we we have now dealt with about fifty, sixty high quality users who, you know, all are looking to build models, train, fine tune, build multimodal outputs.
And we have we can very proudly say that we now serve most of them, and we have road maps to serve the remaining ones with some of the new features that we are going to add. So you you talked about edge cases. We said there is going to be a catch all solution, and that's going to be containers. So our solution is you wanna run fine tuning or training or or inference. There is a very simple CLI. There is a very simple GUI web interface that you can go launch things with, and that should cover 90% of the world. But the other 10%, if they have some custom flows that we maybe, if it becomes important enough, we might actually make it part of our platform managed services offering. But for now, you can pack we can either help you package it in a container, and then now that becomes the workload. The container becomes a workload on our platform, and it asks for the resources it needs. It it basically runs in the right priority order and then gets out of the queue when it's done. But that means that anybody who's not fitting in the way we run our platform or provide the service, they can bring their own containers and make it part of our, scheduling system. It still can leave gaps and holes, and we are claiming that we are gonna leave that point 1% of the market because that's probably out of scope for us for now. I believe pretty much every platform has to make a choice like that, and no one can guarantee 100% coverage, and that's the choice we are making.
[00:44:13] Tobias Macey:
And in your experience of building Flex AI, getting it in front of customers, helping them use it to its fullest effect, what are some of the most interesting or innovative or unexpected ways that you've seen the platform applied?
[00:44:26] Brijesh Tripathi:
Yeah. Yeah. So this there were many, many learnings. One of them was one of our customers was trying to actually get their model trained and get ready for a YC demo. And they were struggling to just get their infrastructure up and running and get the training going. They somehow heard about us, and they reached out to us. And we got them up and running within a couple of hours, and they were able to actually set this up and and go do their demos. There were a couple of other examples where we didn't realize that we can actually solve a specific customer's issue by so this this is an example where somebody had a ton of data, one of the hyperscalers.
And they were looking for compute in, you know, a cheaper location with with a new cloud. And we all were connected to the the same people, and and I knew where the pricing was going to be. It was, like, a dollar 50 compared to $6 on an hyperscaler. So the math was obvious. But the data movement meant that every time they had to pay the egress fees, which was, like, adding up to tens of thousands of dollars, the amount of data was large. Our architecture, even though we were designing it not for this case, our architecture allowed them to save the egress fees for any iterations.
The first time we we put this in a cache, and the caching was done between, cloud, storage and, on node storage. But once the first step was done, there was no more egress fees. And that saved a ton of money for them, and and they were like, we would actually go with this solution because now we have access to all the compute in the world that is the right size, that is the right price, and that is the right capacity for us, and we don't have to worry about data movement or egress fees. There are many other examples that I'm definitely forgetting, but it just makes me happy to see some of the customers' use cases getting solved by the right thinking on our platform and the right solutions that we are building. And this is the reason why we started the company, which is pick a good problem to solve and solve it in a reasonable way, in a fast way.
[00:46:34] Tobias Macey:
And in your experience of building the business, investing in the technology and the abstractions and building the business around it, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?
[00:46:49] Brijesh Tripathi:
Interesting is when we first thought I I talked about this a little bit. When we first thought that we should build a managed service platform where, you know, we can, you know, reduce the number of permutation combination people deal with in terms of doing training or fine tuning. It was very clear that even if we were right, word was actually going to be a pretty fragmented word in terms of how many things people do, how people run their workflows, and and training. So we had to actually adapt very quickly on supporting more than what we were initially planning to support because it was more important for us to meet the customers where they were rather than the idea that we had initially. That was that was a pretty big learning for us, and that that happened very quickly. And I'm really, really glad that we were working with some of the, you know, best developers and and users who were very friendly and gave us support and helped us guide and nudge us in the right direction. That was very helpful. The second thing was, as we started deploying our first iteration of the product, we realized that people are not really excited about switching platforms. They they don't like being on AWS for one thing, on GCP for another thing, and new clouds and and other small cloud providers for some other stuff, even if there is a cost advantage. And it is basically our concept too, which is, you know, everybody needs a different setup, and nobody really has the patience or time to set up all these different environments and manage multi cloud and multi provider. So that translated for us as we cannot offer just one piece of the equation of their workflows and say, use it. This is the best you can get. And then have them go somewhere else to finish their task. It needed to be an end to end workflow support, and that's something that we are actually now enabling and getting tons of good feedback on it. So you don't have to get out of our platform as long as you're not developing some of the largest, largest models in the world. For example, x a I, you are our prime target. If you develop models, if you fine tune models, if you deploy agents, or if you have endpoints that you're deploying, Everything is served in just one unified platform.
[00:49:03] Tobias Macey:
And you mentioned that on the high end, there's a certain level of scale at which it might not make sense to use Flex AI. I'm curious if there is a corresponding cohort on the other side of the spectrum where maybe there's a a point at which you're too small for Flex AI to be beneficial. I'm just wondering if you can talk to some of the ways that you think about sort of your ideal customer and give some guidance for people who are evaluating Flex AI for their own use cases?
[00:49:30] Brijesh Tripathi:
Actually, that the other end is zero. We are so supportive of small developers and startups that, you know, we have no minimum required for on our platform. We offer on demand pricing. We offer credits. We offer ways to minimize your cost. So if you're looking for one GPU equivalent compute or even fractional GPU equivalent compute because we now support fractional GPUs on our platform, This is the right platform for you. So, yes, you know, let's say beyond 10,000 GPUs, you probably shouldn't be using a platform because you'd be trying to get every single ounce of this, you know, very expensive cluster and running a managed service on the top of that is probably not the best use of the resources. And, you know, when you are at that capacity, you're probably also focused on a single workload. You're not doing multiple things simultaneously because when people build models, they just, you know, build models for months. So below that, I I will say Flex AI covers everything under the sky. Once you get on the platform, you don't have to leave it, whether it's building it, optimizing it, tuning it, or deploying it.
[00:50:39] Tobias Macey:
And as you continue to build and iterate on the Flex AI platform, what are some of the things you have planned for the near to medium term or any particular projects or problem areas you're excited to explore?
[00:50:51] Brijesh Tripathi:
Absolutely. I you know, some some of the things that we are very excited about is bringing new architectures on the platform. So, you know, we are working with TPUs. We are working with AMD. We're working with dense torrent. We will continue to work with more, you know, silicon providers. One of the exciting things that we are testing right now in our internal development is inference auto scaler. So we have a tool that we have developed that that is a simulator for the size or the capacity of the cluster needed for a specific throughput required for a given model. We're now applying that to start auto scaling the capacity for a given user or a given workload. And that's going to be the most cost optimized solution for people who have spiky workloads or spiky inference needs. It's going to deploy new nodes and new GPUs within seconds of seeing extra traffic or higher traffic come through. As we were talking about before, I'm also very excited about deploying some blueprints.
The ability to actually translate your vision and idea into a real product. You know, I'm I'm super impressed by Lovable. You have an idea for a for an app or a website. I'm able to now implement that in less than an hour or a day. That's exactly what I want to do with AI applications. You have an idea or a concept. You come on our platform, and you should be able to just use our widgets and and platform and just deploy it within hours or a day. That's actually very, very, very exciting. Now all of that can be summarized into one thing. I would like my fellow start up founders to focus on what they started the company for and leave the DevOps and the infrastructure management to us so they can get the most value out of their their teams and the money they raised. We can help them with providing a seamless interface to a clean infrastructure.
[00:52:48] Tobias Macey:
Are there any other aspects of the work that you're doing at Flex AI or the overall complexities and points of friction related to the infrastructure for training and inference that we didn't discuss yet that you'd like to cover before we close out the show?
[00:53:03] Brijesh Tripathi:
I think we covered pretty much everything. My summary of what we are doing is I I wanna make sure that people who are starting to develop new applications come up with new ideas on using AI to solve some real world problems. Actually, focus on that, and let us take care of the the back end of it, the infrastructures side of it. The the customers are recognizing our value. They are already, you know, using our platform and saving time. In my mind, the success metric for us is going to be when our users can claim that we haven't dealt with infrastructure issues in the last so many months that we have been working with Flexiad. And that's going to be a success for us.
[00:53:45] Tobias Macey:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gaps in the tooling technology or human training that's available for AI systems
[00:54:01] Brijesh Tripathi:
today? I think I think this this this is a complex question with a very simple answer. My take is if you are not using Flex AI, then you're actually building a team of very smart DevOps people, very smart DevOps engineers. And if that was the purpose of starting your business, starting your company, then you're doing the right thing. However, if you want to focus on the workloads and the problems that you're really solving, then your answer is going to be Flex AI. And the long answer of that is every single setup, every single infrastructure setup requires a very different set of expertise, knowledge, and pinpoint resolution.
A lot of people deal with just setting up the infrastructure, maintaining it, keeping it up to date. Right? I mean, which version of PyTorch and which version of CUDA should I actually use for this application? What is the model optimized for versus what is the quantization level that I have to set? Those are two different problems. If you're focused on quantization, just focus on that instead of dealing with which PyTorch version we should be dealing with. And that's actually lacking in the industry. Most people actually are learning on the job, and I would do I we just want to stop doing that. We want people to spend time on developing their applications and let us take care of the infrastructure piece.
[00:55:19] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share the work that you and your team are doing on Flex AI and giving some insights into the complexities of dealing with hardware and optimization of throughput for managing these AI applications. I appreciate all the time and energy that you folks are putting into making that a solved problem for the rest of us, and I hope you enjoy the rest of your day. You do the best. Thank you. Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used. And the AI Engineering podcast is your guide to the fast moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and colleagues.
Crossover intro: AI Engineering meets Data Engineering
Show kickoff and sponsor break (skipped)
Guest intro: Brijesh Tripathi and the origins of Flex AI
From HPC to AI: efficiency, utilization, and a new platform layer
Startup pain: infra friction, costs, and multi‑cloud inconsistencies
Iteration speed as competitive edge and how Flex AI reduces friction
Does Kubernetes solve it? Where it falls short and Flex AI’s approach
GPU heterogeneity: CUDA vs ROCm, drivers, and abstraction choices
Training vs inference: heterogeneous compute across the AI lifecycle
Selecting the right architecture per stage, including edge/ASICs
From GPU-as-a-Service to Workload-as-a-Service
Smoothing costs and demand: capacity planning across clouds
Orchestration and scheduling: priorities, preemption, and checkpointing
How the platform evolved: from training focus to inference scale
Onboarding and interfaces: CLI, blueprints, and ease of use
Abstractions with escape hatches: BYO containers and scope
Customer stories: rapid YC prep and saving egress with caching
Building the business: lessons on fragmentation and end‑to‑end needs
Who Flex AI is for: ideal users from indie devs to large teams
Roadmap: new silicon, autoscaling inference, and app blueprints
Closing thoughts: let builders build; Flex AI handles infra
Wrap‑up and outro