Open Source

ArangoDB: Fast, Scalable, and Multi-Model Data Storage with Jan Steeman and Jan Stücke - Episode 34

Summary

Using a multi-model database in your applications can greatly reduce the amount of infrastructure and complexity required. ArangoDB is a storage engine that supports documents, dey/value, and graph data formats, as well as being fast and scalable. In this episode Jan Steeman and Jan Stücke explain where Arango fits in the crowded database market, how it works under the hood, and how you can start working with it today.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • Your host is Tobias Macey and today I’m interviewing Jan Stücke and Jan Steeman about ArangoDB, a multi-model distributed database for graph, document, and key/value storage.

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you give a high level description of what ArangoDB is and the motivation for creating it?
    • What is the story behind the name?
  • How is ArangoDB constructed?
    • How does the underlying engine store the data to allow for the different ways of viewing it?
  • What are some of the benefits of multi-model data storage?
    • When does it become problematic?
  • For users who are accustomed to a relational engine, how do they need to adjust their approach to data modeling when working with Arango?
  • How does it compare to OrientDB?
  • What are the options for scaling a running system?
    • What are the limitations in terms of network architecture or data volumes?
  • One of the unique aspects of ArangoDB is the Foxx framework for embedding microservices in the data layer. What benefits does that provide over a three tier architecture?
    • What mechanisms do you have in place to prevent data breaches from security vulnerabilities in the Foxx code?
    • What are some of the most interesting or surprising uses of this functionality that you have seen?
  • What are some of the most challenging technical and business aspects of building and promoting ArangoDB?
  • What do you have planned for the future of ArangoDB?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

PrestoDB and Starburst Data with Kamil Bajda-Pawlikowski - Episode 32

Summary

Most businesses end up with data in a myriad of places with varying levels of structure. This makes it difficult to gain insights from across departments, projects, or people. Presto is a distributed SQL engine that allows you to tie all of your information together without having to first aggregate it all into a data warehouse. Kamil Bajda-Pawlikowski co-founded Starburst Data to provide support and tooling for Presto, as well as contributing advanced features back to the project. In this episode he describes how Presto is architected, how you can use it for your analytics, and the work that he is doing at Starburst Data.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • Your host is Tobias Macey and today I’m interviewing Kamil Bajda-Pawlikowski about Presto and his experiences with supporting it at Starburst Data

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Presto is?
    • What are some of the common use cases and deployment patterns for Presto?
  • How does Presto compare to Drill or Impala?
  • What is it about Presto that led you to building a business around it?
  • What are some of the most challenging aspects of running and scaling Presto?
  • For someone who is using the Presto SQL interface, what are some of the considerations that they should keep in mind to avoid writing poorly performing queries?
    • How does Presto represent data for translating between its SQL dialect and the API of the data stores that it interfaces with?
  • What are some cases in which Presto is not the right solution?
  • What types of support have you found to be the most commonly requested?
  • What are some of the types of tooling or improvements that you have made to Presto in your distribution?
    • What are some of the notable changes that your team has contributed upstream to Presto?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / {CC BY-SA](http://creativecommons.org/licenses/by-sa/3.0/)

Metabase Self Service Business Intelligence with Sameer Al-Sakran - Episode 29

Summary

Business Intelligence software is often cumbersome and requires specialized knowledge of the tools and data to be able to ask and answer questions about the state of the organization. Metabase is a tool built with the goal of making the act of discovering information and asking questions of an organizations data easy and self-service for non-technical users. In this episode the CEO of Metabase, Sameer Al-Sakran, discusses how and why the project got started, the ways that it can be used to build and share useful reports, some of the useful features planned for future releases, and how to get it set up to start using it in your environment.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • Your host is Tobias Macey and today I’m interviewing Sameer Al-Sakran about Metabase, a free and open source tool for self service business intelligence

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • The current goal for most companies is to be “data driven”. How would you define that concept?
    • How does Metabase assist in that endeavor?
  • What is the ratio of users that take advantage of the GUI query builder as opposed to writing raw SQL?
    • What level of complexity is possible with the query builder?
  • What have you found to be the typical use cases for Metabase in the context of an organization?
  • How do you manage scaling for large or complex queries?
  • What was the motivation for using Clojure as the language for implementing Metabase?
  • What is involved in adding support for a new data source?
  • What are the differentiating features of Metabase that would lead someone to choose it for their organization?
  • What have been the most challenging aspects of building and growing Metabase, both from a technical and business perspective?
  • What do you have planned for the future of Metabase?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Stretching The Elastic Stack with Philipp Krenn - Episode 23

Summary

Search is a common requirement for applications of all varieties. Elasticsearch was built to make it easy to include search functionality in projects built in any language. From that foundation, the rest of the Elastic Stack has been built, expanding to many more use cases in the proces. In this episode Philipp Krenn describes the various pieces of the stack, how they fit together, and how you can use them in your infrastructure to store, search, and analyze your data.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • Your host is Tobias Macey and today I’m interviewing Philipp Krenn about the Elastic Stack and the ways that you can use it in your systems

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • The Elasticsearch product has been around for a long time and is widely known, but can you give a brief overview of the other components that make up the Elastic Stack and how they work together?
  • Beyond the common pattern of using Elasticsearch as a search engine connected to a web application, what are some of the other use cases for the various pieces of the stack?
  • What are the common scaling bottlenecks that users should be aware of when they are dealing with large volumes of data?
  • What do you consider to be the biggest competition to the Elastic Stack as you expand the capabilities and target usage patterns?
  • What are the biggest challenges that you are tackling in the Elastic stack, technical or otherwise?
  • What are the biggest challenges facing Elastic as a company in the near to medium term?
  • Open source as a business model: https://www.elastic.co/blog/doubling-down-on-open
  • What is the vision for Elastic and the Elastic Stack going forward and what new features or functionality can we look forward to?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18

Summary

As communications between machines become more commonplace the need to store the generated data in a time-oriented manner increases. The market for timeseries data stores has many contenders, but they are not all built to solve the same problems or to scale in the same manner. In this episode the founders of TimescaleDB, Ajay Kulkarni and Mike Freedman, discuss how Timescale was started, the problems that it solves, and how it works under the covers. They also explain how you can start using it in your infrastructure and their plans for the future.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
  • When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • You can help support the show by checking out the Patreon page which is linked from the site.
  • To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
  • Your host is Tobias Macey and today I’m interviewing Ajay Kulkarni and Mike Freedman about Timescale DB, a scalable timeseries database built on top of PostGreSQL

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Timescale is and how the project got started?
  • The landscape of time series databases is extensive and oftentimes difficult to navigate. How do you view your position in that market and what makes Timescale stand out from the other options?
  • In your blog post that explains the design decisions for how Timescale is implemented you call out the fact that the inserted data is largely append only which simplifies the index management. How does Timescale handle out of order timestamps, such as from infrequently connected sensors or mobile devices?
  • How is Timescale implemented and how has the internal architecture evolved since you first started working on it?
    • What impact has the 10.0 release of PostGreSQL had on the design of the project?
    • Is timescale compatible with systems such as Amazon RDS or Google Cloud SQL?
  • For someone who wants to start using Timescale what is involved in deploying and maintaining it?
  • What are the axes for scaling Timescale and what are the points where that scalability breaks down?
    • Are you aware of anyone who has deployed it on top of Citus for scaling horizontally across instances?
  • What has been the most challenging aspect of building and marketing Timescale?
  • When is Timescale the wrong tool to use for time series data?
  • One of the use cases that you call out on your website is for systems metrics and monitoring. How does Timescale fit into that ecosystem and can it be used along with tools such as Graphite or Prometheus?
  • What are some of the most interesting uses of Timescale that you have seen?
  • Which came first, Timescale the business or Timescale the database, and what is your strategy for ensuring that the open source project and the company around it both maintain their health?
  • What features or improvements do you have planned for future releases of Timescale?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17

Summary

One of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have been popular for many years, and recently stream oriented systems such as Kafka have been rising in prominence. This week Rajan Dhabalia and Matteo Merli discuss the work they have done on Pulsar, which supports both options, in addition to being globally scalable and fast. They explain how Pulsar is architected, how to scale it, and how it fits into your existing infrastructure.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
  • When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • You can help support the show by checking out the Patreon page which is linked from the site.
  • To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
  • A few announcements:
    • There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20%
    • The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20%
    • If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.
  • Your host is Tobias Macey and today I’m interviewing Rajan Dhabalia and Matteo Merli about Pulsar, a distributed open source pub-sub messaging system

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Pulsar is and what the original inspiration for the project was?
  • What have been some of the most challenging aspects of building and promoting Pulsar?
  • For someone who wants to run Pulsar, what are the infrastructure and network requirements that they should be considering and what is involved in deploying the various components?
  • What are the scaling factors for Pulsar and what aspects of deployment and administration should users pay special attention to?
  • What projects or services do you consider to be competitors to Pulsar and what makes it stand out in comparison?
  • The documentation mentions that there is an API layer that provides drop-in compatibility with Kafka. Does that extend to also supporting some of the plugins that have developed on top of Kafka?
  • One of the popular aspects of Kafka is the persistence of the message log, so I’m curious how Pulsar manages long-term storage and reprocessing of messages that have already been acknowledged?
  • When is Pulsar the wrong tool to use?
  • What are some of the improvements or new features that you have planned for the future of Pulsar?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Citus Data: Distributed PostGreSQL for Big Data with Ozgun Erdogan and Craig Kerstiens - Episode 13

Summary

PostGreSQL has become one of the most popular and widely used databases, and for good reason. The level of extensibility that it supports has allowed it to be used in virtually every environment. At Citus Data they have built an extension to support running it in a distributed fashion across large volumes of data with parallelized queries for improved performance. In this episode Ozgun Erdogan, the CTO of Citus, and Craig Kerstiens, Citus Product Manager, discuss how the company got started, the work that they are doing to scale out PostGreSQL, and how you can start using it in your environment.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
  • When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
  • Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • You can help support the show by checking out the Patreon page which is linked from the site.
  • To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
  • Your host is Tobias Macey and today I’m interviewing Ozgun Erdogan and Craig Kerstiens about Citus, worry free PostGreSQL

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Citus is and how the project got started?
  • Why did you start with Postgres vs. building something from the ground up?
  • What was the reasoning behind converting Citus from a fork of PostGres to being an extension and releasing an open source version?
  • How well does Citus work with other Postgres extensions, such as PostGIS, PipelineDB, or Timescale?
  • How does Citus compare to options such as PostGres-XL or the Postgres compatible Aurora service from Amazon?
  • How does Citus operate under the covers to enable clustering and replication across multiple hosts?
  • What are the failure modes of Citus and how does it handle loss of nodes in the cluster?
  • For someone who is interested in migrating to Citus, what is involved in getting it deployed and moving the data out of an existing system?
  • How do the different options for leveraging Citus compare to each other and how do you determine which features to release or withhold in the open source version?
  • Are there any use cases that Citus enables which would be impractical to attempt in native Postgres?
  • What have been some of the most challenging aspects of building the Citus extension?
  • What are the situations where you would advise against using Citus?
  • What are some of the most interesting or impressive uses of Citus that you have seen?
  • What are some of the features that you have planned for future releases of Citus?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

SiriDB: Scalable Open Source Timeseries Database with Jeroen van der Heijden - Episode 11

Summary

Time series databases have long been the cornerstone of a robust metrics system, but the existing options are often difficult to manage in production. In this episode Jeroen van der Heijden explains his motivation for writing a new database, SiriDB, the challenges that he faced in doing so, and how it works under the hood.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
  • When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
  • Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • You can help support the show by checking out the Patreon page which is linked from the site.
  • To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
  • Your host is Tobias Macey and today I’m interviewing Jeroen van der Heijden about SiriDB, a next generation time series database

Interview

  • Introduction
  • How did you get involved in the area of data engineering?
  • What is SiriDB and how did the project get started?
    • What was the inspiration for the name?
  • What was the landscape of time series databases at the time that you first began work on Siri?
  • How does Siri compare to other time series databases such as InfluxDB, Timescale, KairosDB, etc.?
  • What do you view as the competition for Siri?
  • How is the server architected and how has the design evolved over the time that you have been working on it?
  • Can you describe how the clustering mechanism functions?
    • Is it possible to create pools with more than two servers?
  • What are the failure modes for SiriDB and where does it fall on the spectrum for the CAP theorem?
  • In the documentation it mentions needing to specify the retention period for the shards when creating a database. What is the reasoning for that and what happens to the individual metrics as they age beyond that time horizon?
  • One of the common difficulties when using a time series database in an operations context is the need for high cardinality of the metrics. How are metrics identified in Siri and is there any support for tagging?
  • What have been the most challenging aspects of building Siri?
  • In what situations or environments would you advise against using Siri?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Rebuilding Yelp's Data Pipeline with Justin Cunningham - Episode 5

Summary

Yelp needs to be able to consume and process all of the user interactions that happen in their platform in as close to real-time as possible. To achieve that goal they embarked on a journey to refactor their monolithic architecture to be more modular and modern, and then they open sourced it! In this episode Justin Cunningham joins me to discuss the decisions they made and the lessons they learned in the process, including what worked, what didn’t, and what he would do differently if he was starting over today.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
  • When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at www.dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • You can help support the show by checking out the Patreon page which is linked from the site.
  • To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
  • Your host is Tobias Macey and today I’m interviewing Justin Cunningham about Yelp’s data pipeline

Interview with Justin Cunningham

  • Introduction
  • How did you get involved in the area of data engineering?
  • Can you start by giving an overview of your pipeline and the type of workload that you are optimizing for?
  • What are some of the dead ends that you experienced while designing and implementing your pipeline?
  • As you were picking the components for your pipeline, how did you prioritize the build vs buy decisions and what are the pieces that you ended up building in-house?
  • What are some of the failure modes that you have experienced in the various parts of your pipeline and how have you engineered around them?
  • What are you using to automate deployment and maintenance of your various components and how do you monitor them for availability and accuracy?
  • While you were re-architecting your monolithic application into a service oriented architecture and defining the flows of data, how were you able to make the switch while verifying that you were not introducing unintended mutations into the data being produced?
  • Did you plan to open-source the work that you were doing from the start, or was that decision made after the project was completed? What were some of the challenges associated with making sure that it was properly structured to be amenable to making it public?
  • What advice would you give to anyone who is starting a brand new project and how would that advice differ for someone who is trying to retrofit a data management architecture onto an existing project?

Keep in touch

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA