Sharing data across multiple computers, particularly when it is large and changing, is a difficult problem to solve. In order to provide a simpler way to distribute and version data sets among collaborators the Dat Project was created. In this episode Danielle Robinson and Joe Hand explain how the project got started, how it functions, and some of the many ways that it can be used. They also explain the plans that the team has for upcoming features and uses that you can watch out for in future releases.
The majority of the conversation around machine learning and big data pertains to well-structured and cleaned data sets. Unfortunately, that is just a small percentage of the information that is available, so the rest of the sources of knowledge in a company are housed in so-called “Dark Data” sets. In this episode Alex Ratner explains how the work that he and his fellow researchers are doing on Snorkel can be used to extract value by leveraging labeling functions written by domain experts to generate training sets for machine learning models. He also explains how this approach can be used to democratize machine learning by making it feasible for organizations with smaller data sets than those required by most tooling.