There are countless sources of data that are publicly available for use. Unfortunately, combining those sources and making them useful in aggregate is a time consuming and challenging process. The team at Enigma builds a knowledge graph for use in your own data projects. In this episode Chris Groskopf explains the platform they have built to consume large varieties and volumes of public data for constructing a graph for serving to their customers. He discusses the challenges they are facing to scale the platform and engineering processes, as well as the workflow that they have established to enable testing of their ETL jobs. This is a great episode to listen to for ideas on how to organize a data engineering organization.
As your data needs scale across an organization the need for a carefully considered approach to collection, storage, organization, and access becomes increasingly critical. In this episode Todd Walter shares his considerable experience in data curation to clarify the many aspects that are necessary for a successful platform for your business. Using the metaphor of a museum curator carefully managing the precious resources on display and in the vaults, he discusses the various layers of an enterprise data strategy. This includes modeling the lifecycle of your information as a pipeline from the raw, messy, loosely structured records in your data lake, through a series of transformations and ultimately to your data warehouse. He also explains which layers are useful for the different members of the business, and which pitfalls to look out for along the path to a mature and flexible data platform.
Every business with a website needs some way to keep track of how much traffic they are getting, where it is coming from, and which actions are being taken. The default in most cases is Google Analytics, but this can be limiting when you wish to perform detailed analysis of the captured data. To address this problem, Alex Dean co-founded Snowplow Analytics to build an open source platform that gives you total control of your website traffic data. In this episode he explains how the project and company got started, how the platform is architected, and how you can start using it today to get a clearer view of how your customers are interacting with your web and mobile applications.
Elasticsearch is a powerful tool for storing and analyzing data, but when using it for logs and other time oriented information it can become problematic to keep all of your history. Chaos Search was started to make it easy for you to keep all of your data and make it usable in S3, so that you can have the best of both worlds. In this episode the CTO, Thomas Hazel, and VP of Product, Pete Cheslock, describe how they have built a platform to let you keep all of your history, save money, and reduce your operational overhead. They also explain some of the types of data that you can use with Chaos Search, how to load it into S3, and when you might want to choose it over Amazon Athena for our serverless data analysis.
With the proliferation of data sources to give a more comprehensive view of the information critical to your business it is even more important to have a canonical view of the entities that you care about. Is customer number 342 in your ERP the same as Bob Smith on Twitter? Building a master data set helps you answer these questions reliably and simplify the process of building your business intelligence reports. In this episode the head of product at Tamr, Mark Marinelli, discusses the challenges of building a master data set, why you should have one, and some of the techniques that modern platforms and systems provide for maintaining it.
There are myriad reasons why data should be protected, and just as many ways to enforce it in tranist or at rest. Unfortunately, there is still a weak point where attackers can gain access to your unencrypted information. In this episode Ellison Anny Williams, CEO of Enveil, describes how her company uses homomorphic encryption to ensure that your analytical queries can be executed without ever having to decrypt your data.
The way that you store your data can have a huge impact on the ways that it can be practically used. For a substantial number of use cases, the optimal format for storing and querying that information is as a graph, however databases architected around that use case have historically been difficult to use at scale or for serving fast, distributed queries. In this episode Manish Jain explains how DGraph is overcoming those limitations, how the project got started, and how you can start using it today. He also discusses the various cases where a graph storage layer is beneficial, and when you would be better off using something else. In addition he talks about the challenges of building a distributed, consistent database and the tradeoffs that were made to make DGraph a reality.
The theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning the pitfalls and best practices from someone who has gained that knowledge the hard way can save you from wasted time and frustration. In this episode James Meickle discusses his recent experience building a new installation of Airflow. He points out the strengths, design flaws, and areas of improvement for the framework. He also describes the design patterns and workflows that his team has built to allow them to use Airflow as the basis of their data science platform.
One of the longest running and most popular open source database projects is PostgreSQL. Because of its extensibility and a community focus on stability it has stayed relevant as the ecosystem of development environments and data requirements have changed and evolved over its lifetime. It is difficult to capture any single facet of this database in a single conversation, let alone the entire surface area, but in this episode Jonathan Katz does an admirable job of it. He explains how Postgres started and how it has grown over the years, highlights the fundamental features that make it such a popular choice for application developers, and the ongoing efforts to add the complex features needed by the demanding workloads of today’s data layer. To cap it off he reviews some of the exciting features that the community is working on building into future releases.
With the attention being paid to the systems that power large volumes of high velocity data it is easy to forget about the value of data collection at human scales. Ona is a company that is building technologies to support mobile data collection, analysis of the aggregated information, and user-friendly presentations. In this episode CTO Peter Lubell-Doughtie describes the architecture of the platform, the types of environments and use cases where it is being employed, and the value of small data.