Databases

TimescaleDB: The Timeseries Database Built For SQL And Scale - Episode 65

Summary

The past year has been an active one for the timeseries market. New products have been launched, more businesses have moved to streaming analytics, and the team at Timescale has been keeping busy. In this episode the TimescaleDB CEO Ajay Kulkarni and CTO Michael Freedman stop by to talk about their 1.0 release, how the use cases for timeseries data have proliferated, and how they are continuing to simplify the task of processing your time oriented events.

Introduction

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m welcoming Ajay Kulkarni and Mike Freedman back to talk about how TimescaleDB has grown and changed over the past year

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you refresh our memory about what TimescaleDB is?
  • How has the market for timeseries databases changed since we last spoke?
  • What has changed in the focus and features of the TimescaleDB project and company?
  • Toward the end of 2018 you launched the 1.0 release of Timescale. What were your criteria for establishing that milestone?
    • What were the most challenging aspects of reaching that goal?
  • In terms of timeseries workloads, what are some of the factors that differ across varying use cases?
    • How do those differences impact the ways in which Timescale is used by the end user, and built by your team?
  • What are some of the initial assumptions that you made while first launching Timescale that have held true, and which have been disproven?
  • How have the improvements and new features in the recent releases of PostgreSQL impacted the Timescale product?
    • Have you been able to leverage some of the native improvements to simplify your implementation?
    • Are there any use cases for Timescale that would have been previously impractical in vanilla Postgres that would now be reasonable without the help of Timescale?
  • What is in store for the future of the Timescale product and organization?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Continuously Query Your Time-Series Data Using PipelineDB with Derek Nelson and Usman Masood - Episode 62

Summary

Processing high velocity time-series data in real-time is a complex challenge. The team at PipelineDB has built a continuous query engine that simplifies the task of computing aggregates across incoming streams of events. In this episode Derek Nelson and Usman Masood explain how it is architected, strategies for designing your data flows, how to scale it up and out, and edge cases to be aware of.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Usman Masood and Derek Nelson about PipelineDB, an open source continuous query engine for PostgreSQL

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what PipelineDB is and the motivation for creating it?
    • What are the major use cases that it enables?
    • What are some example applications that are uniquely well suited to the capabilities of PipelineDB?
  • What are the major concepts and components that users of PipelineDB should be familiar with?
  • Given the fact that it is a plugin for PostgreSQL, what level of compatibility exists between PipelineDB and other plugins such as Timescale and Citus?
  • What are some of the common patterns for populating data streams?
  • What are the options for scaling PipelineDB systems, both vertically and horizontally?
    • How much elasticity does the system support in terms of changing volumes of inbound data?
    • What are some of the limitations or edge cases that users should be aware of?
  • Given that inbound data is not persisted to disk, how do you guard against data loss?
    • Is it possible to archive the data in a stream, unaltered, to a separate destination table or other storage location?
    • Can a separate table be used as an input stream?
  • Since the data being processed by the continuous queries is potentially unbounded, how do you approach checkpointing or windowing the data in the continuous views?
  • What are some of the features that you have found to be the most useful which users might initially overlook?
  • What would be involved in generating an alert or notification on an aggregate output that was in some way anomalous?
  • What are some of the most challenging aspects of building continuous aggregates on unbounded data?
  • What have you found to be some of the most interesting, complex, or challenging aspects of building and maintaining PipelineDB?
  • What are some of the most interesting or unexpected ways that you have seen PipelineDB used?
  • When is PipelineDB the wrong choice?
  • What do you have planned for the future of PipelineDB now that you have hit the 1.0 milestone?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52

Summary

With the growth of the Hadoop ecosystem came a proliferation of implementations for the Hive table format. Unfortunately, with no formal specification, each project works slightly different which increases the difficulty of integration across systems. The Hive format is also built with the assumptions of a local filesystem which results in painful edge cases when leveraging cloud object storage for a data lake. In this episode Ryan Blue explains how his work on the Iceberg table format specification and reference implementation has allowed Netflix to improve the performance and simplify operations for their S3 data lake. This is a highly detailed and technical exploration of how a well-engineered metadata layer can improve the speed, accuracy, and utility of large scale, multi-tenant, cloud-native data platforms.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Ryan Blue about Iceberg, a Netflix project to implement a high performance table format for batch workloads

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Iceberg is and the motivation for creating it?
    • Was the project built with open-source in mind or was it necessary to refactor it from an internal project for public use?
  • How has the use of Iceberg simplified your work at Netflix?
  • How is the reference implementation architected and how has it evolved since you first began work on it?
    • What is involved in deploying it to a user’s environment?
  • For someone who is interested in using Iceberg within their own environments, what is involved in integrating it with their existing query engine?
    • Is there a migration path for pre-existing tables into the Iceberg format?
  • How is schema evolution managed at the file level?
    • How do you handle files on disk that don’t contain all of the fields specified in a table definition?
  • One of the complicated problems in data modeling is managing table partitions. How does Iceberg help in that regard?
  • What are the unique challenges posed by using S3 as the basis for a data lake?
    • What are the benefits that outweigh the difficulties?
  • What have been some of the most challenging or contentious details of the specification to define?
    • What are some things that you have explicitly left out of the specification?
  • What are your long-term goals for the Iceberg specification?
    • Do you anticipate the reference implementation continuing to be used and maintained?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Graph Databases In Production At Scale Using DGraph with Manish Jain - Episode 44

Summary

The way that you store your data can have a huge impact on the ways that it can be practically used. For a substantial number of use cases, the optimal format for storing and querying that information is as a graph, however databases architected around that use case have historically been difficult to use at scale or for serving fast, distributed queries. In this episode Manish Jain explains how DGraph is overcoming those limitations, how the project got started, and how you can start using it today. He also discusses the various cases where a graph storage layer is beneficial, and when you would be better off using something else. In addition he talks about the challenges of building a distributed, consistent database and the tradeoffs that were made to make DGraph a reality.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • If you have ever wished that you could use the same tools for versioning and distributing your data that you use for your software then you owe it to yourself to check out what the fine folks at Quilt Data have built. Quilt is an open source platform for building a sane workflow around your data that works for your whole team, including version history, metatdata management, and flexible hosting. Stop by their booth at JupyterCon in New York City on August 22nd through the 24th to say Hi and tell them that the Data Engineering Podcast sent you! After that, keep an eye on the AWS marketplace for a pre-packaged version of Quilt for Teams to deploy into your own environment and stop fighting with your data.
  • Python has quickly become one of the most widely used languages by both data engineers and data scientists, letting everyone on your team understand each other more easily. However, it can be tough learning it when you’re just starting out. Luckily, there’s an easy way to get involved. Written by MIT lecturer Ana Bell and published by Manning Publications, Get Programming: Learn to code with Python is the perfect way to get started working with Python. Ana’s experience
    as a teacher of Python really shines through, as you get hands-on with the language without being drowned in confusing jargon or theory. Filled with practical examples and step-by-step lessons to take on, Get Programming is perfect for people who just want to get stuck in with Python. Get your copy of the book with a special 40% discount for Data Engineering Podcast listeners by going to dataengineeringpodcast.com/get-programming and use the discount code PodInit40!
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Manish Jain about DGraph, a low latency, high throughput, native and distributed graph database.

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What is DGraph and what motivated you to build it?
  • Graph databases and graph algorithms have been part of the computing landscape for decades. What has changed in recent years to allow for the current proliferation of graph oriented storage systems?
    • The graph space is becoming crowded in recent years. How does DGraph compare to the current set of offerings?
  • What are some of the common uses of graph storage systems?
    • What are some potential uses that are often overlooked?
  • There are a few ways that graph structures and properties can be implemented, including the ability to store data in the vertices connecting nodes and the structures that can be contained within the nodes themselves. How is information represented in DGraph and what are the tradeoffs in the approach that you chose?
  • How does the query interface and data storage in DGraph differ from other options?
    • What are your opinions on the graph query languages that have been adopted by other storages systems, such as Gremlin, Cypher, and GSQL?
  • How is DGraph architected and how has that architecture evolved from when it first started?
  • How do you balance the speed and agility of schema on read with the additional application complexity that is required, as opposed to schema on write?
  • In your documentation you contend that DGraph is a viable replacement for RDBMS-oriented primary storage systems. What are the switching costs for someone looking to make that transition?
  • What are the limitations of DGraph in terms of scalability or usability?
  • Where does it fall along the axes of the CAP theorem?
  • For someone who is interested in building on top of DGraph and deploying it to production, what does their workflow and operational overhead look like?
  • What have been the most challenging aspects of building and growing the DGraph project and community?
  • What are some of the most interesting or unexpected uses of DGraph that you are aware of?
  • When is DGraph the wrong choice?
  • What are your plans for the future of DGraph?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42

Summary

One of the longest running and most popular open source database projects is PostgreSQL. Because of its extensibility and a community focus on stability it has stayed relevant as the ecosystem of development environments and data requirements have changed and evolved over its lifetime. It is difficult to capture any single facet of this database in a single conversation, let alone the entire surface area, but in this episode Jonathan Katz does an admirable job of it. He explains how Postgres started and how it has grown over the years, highlights the fundamental features that make it such a popular choice for application developers, and the ongoing efforts to add the complex features needed by the demanding workloads of today’s data layer. To cap it off he reviews some of the exciting features that the community is working on building into future releases.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Jonathan Katz about a high level view of PostgreSQL and the unique capabilities that it offers

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • How did you get involved in the Postgres project?
  • For anyone who hasn’t used it, can you describe what PostgreSQL is?
    • Where did Postgres get started and how has it evolved over the intervening years?
  • What are some of the primary characteristics of Postgres that would lead someone to choose it for a given project?
    • What are some cases where Postgres is the wrong choice?
  • What are some of the common points of confusion for new users of PostGreSQL? (particularly if they have prior database experience)
  • The recent releases of Postgres have had some fairly substantial improvements and new features. How does the community manage to balance stability and reliability against the need to add new capabilities?
  • What are the aspects of Postgres that allow it to remain relevant in the current landscape of rapid evolution at the data layer?
  • Are there any plans to incorporate a distributed transaction layer into the core of the project along the lines of what has been done with Citus or CockroachDB?
  • What is in store for the future of Postgres?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

User Analytics In Depth At Heap with Dan Robinson - Episode 36

Summary

Web and mobile analytics are an important part of any business, and difficult to get right. The most frustrating part is when you realize that you haven’t been tracking a key interaction, having to write custom logic to add that event, and then waiting to collect data. Heap is a platform that automatically tracks every event so that you can retroactively decide which actions are important to your business and easily build reports with or without SQL. In this episode Dan Robinson, CTO of Heap, describes how they have architected their data infrastructure, how they build their tracking agents, and the data virtualization layer that enables users to define their own labels.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Your host is Tobias Macey and today I’m interviewing Dan Robinson about Heap and their approach to collecting, storing, and analyzing large volumes of data

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by giving a brief overview of Heap?
  • One of your differentiating features is the fact that you capture every interaction on web and mobile platforms for your customers. How do you prevent the user experience from suffering as a result of network congestion, while ensuring the reliable delivery of that data?
  • Can you walk through the lifecycle of a single event from source to destination and the infrastructure components that it traverses to get there?
  • Data collected in a user’s browser can often be messy due to various browser plugins, variations in runtime capabilities, etc. How do you ensure the integrity and accuracy of that information?
    • What are some of the difficulties that you have faced in establishing a representation of events that allows for uniform processing and storage?
  • What is your approach for merging and enriching event data with the information that you retrieve from your supported integrations?
    • What challenges does that pose in your processing architecture?
  • What are some of the problems that you have had to deal with to allow for processing and storing such large volumes of data?
    • How has that architecture changed or evolved over the life of the company?
    • What are some changes that you are anticipating in the near future?
  • Can you describe your approach for synchronizing customer data with their individual Redshift instances and the difficulties that entails?
  • What are some of the most interesting challenges that you have faced while building the technical and business aspects of Heap?
  • What changes have been necessary as a result of GDPR?
  • What are your plans for the future of Heap?

Contact Info

  • @danlovesproofs on twitter
  • [email protected]
  • @drob on github
  • heapanalytics.com / @heap on twitter
  • https://heapanalytics.com/blog/category/engineering

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

CockroachDB In Depth with Peter Mattis - Episode 35

Summary

With the increased ease of gaining access to servers in data centers across the world has come the need for supporting globally distributed data storage. With the first wave of cloud era databases the ability to replicate information geographically came at the expense of transactions and familiar query languages. To address these shortcomings the engineers at Cockroach Labs have built a globally distributed SQL database with full ACID semantics in Cockroach DB. In this episode Peter Mattis, the co-founder and VP of Engineering at Cockroach Labs, describes the architecture that underlies the database, the challenges they have faced along the way, and the ways that you can use it in your own environments today.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Your host is Tobias Macey and today I’m interviewing Peter Mattis about CockroachDB, the SQL database for global cloud services

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What was the motivation for creating CockroachDB and building a business around it?
  • Can you describe the architecture of CockroachDB and how it supports distributed ACID transactions?
    • What are some of the tradeoffs that are necessary to allow for georeplicated data with distributed transactions?
    • What are some of the problems that you have had to work around in the RAFT protocol to provide reliable operation of the clustering mechanism?
  • Go is an unconventional language for building a database. What are the pros and cons of that choice?
  • What are some of the common points of confusion that users of CockroachDB have when operating or interacting with it?
    • What are the edge cases and failure modes that users should be aware of?
  • I know that your SQL syntax is PostGreSQL compatible, so is it possible to use existing ORMs unmodified with CockroachDB?
    • What are some examples of extensions that are specific to CockroachDB?
  • What are some of the most interesting uses of CockroachDB that you have seen?
  • When is CockroachDB the wrong choice?
  • What do you have planned for the future of CockroachDB?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

ArangoDB: Fast, Scalable, and Multi-Model Data Storage with Jan Steeman and Jan Stücke - Episode 34

Summary

Using a multi-model database in your applications can greatly reduce the amount of infrastructure and complexity required. ArangoDB is a storage engine that supports documents, dey/value, and graph data formats, as well as being fast and scalable. In this episode Jan Steeman and Jan Stücke explain where Arango fits in the crowded database market, how it works under the hood, and how you can start working with it today.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • Your host is Tobias Macey and today I’m interviewing Jan Stücke and Jan Steeman about ArangoDB, a multi-model distributed database for graph, document, and key/value storage.

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you give a high level description of what ArangoDB is and the motivation for creating it?
    • What is the story behind the name?
  • How is ArangoDB constructed?
    • How does the underlying engine store the data to allow for the different ways of viewing it?
  • What are some of the benefits of multi-model data storage?
    • When does it become problematic?
  • For users who are accustomed to a relational engine, how do they need to adjust their approach to data modeling when working with Arango?
  • How does it compare to OrientDB?
  • What are the options for scaling a running system?
    • What are the limitations in terms of network architecture or data volumes?
  • One of the unique aspects of ArangoDB is the Foxx framework for embedding microservices in the data layer. What benefits does that provide over a three tier architecture?
    • What mechanisms do you have in place to prevent data breaches from security vulnerabilities in the Foxx code?
    • What are some of the most interesting or surprising uses of this functionality that you have seen?
  • What are some of the most challenging technical and business aspects of building and promoting ArangoDB?
  • What do you have planned for the future of ArangoDB?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

PrestoDB and Starburst Data with Kamil Bajda-Pawlikowski - Episode 32

Summary

Most businesses end up with data in a myriad of places with varying levels of structure. This makes it difficult to gain insights from across departments, projects, or people. Presto is a distributed SQL engine that allows you to tie all of your information together without having to first aggregate it all into a data warehouse. Kamil Bajda-Pawlikowski co-founded Starburst Data to provide support and tooling for Presto, as well as contributing advanced features back to the project. In this episode he describes how Presto is architected, how you can use it for your analytics, and the work that he is doing at Starburst Data.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • Your host is Tobias Macey and today I’m interviewing Kamil Bajda-Pawlikowski about Presto and his experiences with supporting it at Starburst Data

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Presto is?
    • What are some of the common use cases and deployment patterns for Presto?
  • How does Presto compare to Drill or Impala?
  • What is it about Presto that led you to building a business around it?
  • What are some of the most challenging aspects of running and scaling Presto?
  • For someone who is using the Presto SQL interface, what are some of the considerations that they should keep in mind to avoid writing poorly performing queries?
    • How does Presto represent data for translating between its SQL dialect and the API of the data stores that it interfaces with?
  • What are some cases in which Presto is not the right solution?
  • What types of support have you found to be the most commonly requested?
  • What are some of the types of tooling or improvements that you have made to Presto in your distribution?
    • What are some of the notable changes that your team has contributed upstream to Presto?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / {CC BY-SA](http://creativecommons.org/licenses/by-sa/3.0/)

Brief Conversations From The Open Data Science Conference: Part 2 - Episode 31

Summary

The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Andy Eschbacher of Carto. He dscribes some of the complexities inherent to working with geospatial data, how they are handling it, and some of the interesting use cases that they enable for their customers. Next is Todd Blaschka, COO of TigerGraph. He explains how graph databases differ from relational engines, where graph algorithms are useful, and how TigerGraph is built to alow for fast and scalable operation.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Your host is Tobias Macey and last week I attended the Open Data Science Conference in Boston and recorded a few brief interviews on-site. In this second part you will hear from Andy Eschbacher of Carto about the challenges of managing geospatial data, as well as Todd Blaschka of TigerGraph about graph databases and how his company has managed to build a fast and scalable platform for graph storage and traversal.

Interview

Andy Eschbacher From Carto

  • What are the challenges associated with storing geospatial data?
  • What are some of the common misconceptions that people have about working with geospatial data?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Todd Blaschka From TigerGraph

  • What are graph databases and how do they differ from relational engines?
  • What are some of the common difficulties that people have when deling with graph algorithms?
  • How does data modeling for graph databases differ from relational stores?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA