Machine Learning

Bringing Automation To Data Labeling For Machine Learning With Watchful - Episode 316

Data engineers have typically left the process of data labeling to data scientists or other roles because of its nature as a manual and process heavy undertaking, focusing instead on building automation and repeatable systems. Watchful is a platform to make labeling a repeatable and scalable process that relies on codifying domain expertise. In this episode founder Shayan Mohanty explains how he and his team are bringing software best practices and automation to the world of machine learning data preparation and how it allows data engineers to be involved in the process.

Read More

Optimize Your Machine Learning Development And Serving With The Open Source Vector Database Milvus - Episode 313

The optimal format for storage and retrieval of data is dependent on how it is going to be used. For analytical systems there are decades of investment in data warehouses and various modeling techniques. For machine learning applications relational models require additional processing to be directly useful, which is why there has been a growth in the use of vector databases. These platforms store direct representations of the vector embeddings that machine learning models rely on for computing relevant predictions so that there is no additional processing required to go from input data to inference output. In this episode Frank Liu explains how the open source Milvus vector database is implemented to speed up machine learning development cycles, how to think about proper storage and scaling of these vectors, and how data engineering and machine learning teams can collaborate on the creation and maintenance of these data sets.

Read More

Declarative Machine Learning Without The Operational Overhead Using Continual - Episode 222

Building, scaling, and maintaining the operational components of a machine learning workflow are all hard problems. Add the work of creating the model itself, and it’s not surprising that a majority of companies that could greatly benefit from machine learning have yet to either put it into production or see the value. Tristan Zajonc recognized the complexity that acts as a barrier to adoption and created the Continual platform in response. In this episode he shares his perspective on the benefits of declarative machine learning workflows as a means of accelerating adoption in businesses that don’t have the time, money, or ambition to build everything from scratch. He also discusses the technical underpinnings of what he is building and how using the data warehouse as a shared resource drastically shortens the time required to see value. This is a fascinating episode and Tristan’s work at Continual is likely to be the catalyst for a new stage in the machine learning community.

Read More

Prepare Your Unstructured Data For Machine Learning And Computer Vision Without The Toil Using Activeloop - Episode 212

The vast majority of data tools and platforms that you hear about are designed for working with structured, text-based data. What do you do when you need to manage unstructured information, or build a computer vision model? Activeloop was created for exactly that purpose. In this episode Davit Buniatyan, founder and CEO of Activeloop, explains why he is spending his time and energy on building a platform to simplify the work of getting your unstructured data ready for machine learning. He discusses the inefficiencies that teams run into from having to reprocess data multiple times, his work on the open source Hub library to solve this problem for everyone, and his thoughts on the vast potential that exists for using computer vision to solve hard and meaningful problems.

Read More

Make Database Performance Optimization A Playful Experience With OtterTune - Episode 197

The database is the core of any system because it holds the data that drives your entire experience. We spend countless hours designing the data model, updating engine versions, and tuning performance. But how confident are you that you have configured it to be as performant as possible, given the dozens of parameters and how they interact with each other? Andy Pavlo researches autonomous database systems, and out of that research he created OtterTune to find the optimal set of parameters to use for your specific workload. In this episode he explains how the system works, the challenge of scaling it to work across different database engines, and his hopes for the future of database systems.

Read More

Accelerating Machine Learning Training And Delivery With In-Database ML - Episode 195

When you build a machine learning model, the first step is always to load your data. Typically this means downloading files from object storage, or querying a database. To speed up the process, why not build the model inside the database so that you don’t have to move the information? In this episode Paige Roberts explains the benefits of pushing the machine learning processing into the database layer and the approach that Vertica has taken for their implementation. If you are looking for a way to speed up your experimentation, or an easy way to apply AutoML then this conversation is for you.

Read More

Easily Build Advanced Similarity Search With The Pinecone Vector Database - Episode 189

Machine learning models use vectors as the natural mechanism for representing their internal state. The problem is that in order for the models to integrate with external systems their internal state has to be translated into a lower dimension. To eliminate this impedance mismatch Edo Liberty founded Pinecone to build database that works natively with vectors. In this episode he explains how this technology will allow teams to accelerate the speed of innovation, how vectors make it possible to build more advanced search functionality, and how Pinecone is architected. This is an interesting conversation about how reconsidering the architecture of your systems can unlock impressive new capabilities.

Read More

Moving Machine Learning Into The Data Pipeline at Cherre - Episode 181

Most of the time when you think about a data pipeline or ETL job what comes to mind is a purely mechanistic progression of functions that move data from point A to point B. Sometimes, however, one of those transformations is actually a full-fledged machine learning project in its own right. In this episode Tal Galfsky explains how he and the team at Cherre tackled the problem of messy data for Addresses by building a natural language processing and entity resolution system that is served as an API to the rest of their pipelines. He discusses the myriad ways that addresses are incomplete, poorly formed, and just plain wrong, why it was a big enough pain point to invest in building an industrial strength solution for it, and how it actually works under the hood. After listening to this you’ll look at your data pipelines in a new light and start to wonder how you can bring more advanced strategies into the cleaning and transformation process.

Read More

Leave Your Data Where It Is And Automate Feature Extraction With Molecula - Episode 175

A majority of the time spent in data engineering is copying data between systems to make the information available for different purposes. This introduces challenges such as keeping information synchronized, managing schema evolution, building transformations to match the expectations of the destination systems. H.O. Maycotte was faced with these same challenges but at a massive scale, leading him to question if there is a better way. After tasking some of his top engineers to consider the problem in a new light they created the Pilosa engine. In this episode H.O. explains how using Pilosa as the core he built the Molecula platform to eliminate the need to copy data between systems in able to make it accessible for analytical and machine learning purposes. He also discusses the challenges that he faces in helping potential users and customers understand the shift in thinking that this creates, and how the system is architected to make it possible. This is a fascinating conversation about what the future looks like when you revisit your assumptions about how systems are designed.

Read More

Bridging The Gap Between Machine Learning And Operations At Iguazio - Episode 174

The process of building and deploying machine learning projects requires a staggering number of systems and stakeholders to work in concert. In this episode Yaron Haviv, co-founder of Iguazio, discusses the complexities inherent to the process, as well as how he has worked to democratize the technologies necessary to make machine learning operations maintainable.

Read More