Companies

PrestoDB and Starburst Data with Kamil Bajda-Pawlikowski - Episode 32

Summary

Most businesses end up with data in a myriad of places with varying levels of structure. This makes it difficult to gain insights from across departments, projects, or people. Presto is a distributed SQL engine that allows you to tie all of your information together without having to first aggregate it all into a data warehouse. Kamil Bajda-Pawlikowski co-founded Starburst Data to provide support and tooling for Presto, as well as contributing advanced features back to the project. In this episode he describes how Presto is architected, how you can use it for your analytics, and the work that he is doing at Starburst Data.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • Your host is Tobias Macey and today I’m interviewing Kamil Bajda-Pawlikowski about Presto and his experiences with supporting it at Starburst Data

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Presto is?
    • What are some of the common use cases and deployment patterns for Presto?
  • How does Presto compare to Drill or Impala?
  • What is it about Presto that led you to building a business around it?
  • What are some of the most challenging aspects of running and scaling Presto?
  • For someone who is using the Presto SQL interface, what are some of the considerations that they should keep in mind to avoid writing poorly performing queries?
    • How does Presto represent data for translating between its SQL dialect and the API of the data stores that it interfaces with?
  • What are some cases in which Presto is not the right solution?
  • What types of support have you found to be the most commonly requested?
  • What are some of the types of tooling or improvements that you have made to Presto in your distribution?
    • What are some of the notable changes that your team has contributed upstream to Presto?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / {CC BY-SA](http://creativecommons.org/licenses/by-sa/3.0/)

Brief Conversations From The Open Data Science Conference: Part 1 - Episode 30

Summary

The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Alan Anders, the CTO of Applecart about their challenges with getting Spark to scale for constructing an entity graph from multiple data sources. Next I spoke with Stepan Pushkarev, the CEO, CTO, and Co-Founder of Hydrosphere.io about the challenges of running machine learning models in production and how his team tracks key metrics and samples production data to re-train and re-deploy those models for better accuracy and more robust operation.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • Your host is Tobias Macey and this week I attended the Open Data Science Conference in Boston and recorded a few brief interviews on-site. First up you’ll hear from Alan Anders, the CTO of Applecart about their challenges with getting Spark to scale for constructing an entity graph from multiple data sources. Next I spoke with Stepan Pushkarev, the CEO, CTO, and Co-Founder of Hydrosphere.io about the challenges of running machine learning models in production and how his team tracks key metrics and samples production data to re-train and re-deploy those models for better accuracy and more robust operation.

Interview

Alan Anders from Applecart

  • What are the challenges of gathering and processing data from multiple data sources and representing them in a unified manner for merging into single entities?
  • What are the biggest technical hurdles at Applecart?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Stepan Pushkarev from Hydrosphere.io

  • What is Hydropshere.io?
  • What metrics do you track to determine when a machine learning model is not producing an appropriate output?
  • How do you determine which data points to sample for retraining the model?
  • How does the role of a machine learning engineer differ from data engineers and data scientists?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Metabase Self Service Business Intelligence with Sameer Al-Sakran - Episode 29

Summary

Business Intelligence software is often cumbersome and requires specialized knowledge of the tools and data to be able to ask and answer questions about the state of the organization. Metabase is a tool built with the goal of making the act of discovering information and asking questions of an organizations data easy and self-service for non-technical users. In this episode the CEO of Metabase, Sameer Al-Sakran, discusses how and why the project got started, the ways that it can be used to build and share useful reports, some of the useful features planned for future releases, and how to get it set up to start using it in your environment.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • Your host is Tobias Macey and today I’m interviewing Sameer Al-Sakran about Metabase, a free and open source tool for self service business intelligence

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • The current goal for most companies is to be “data driven”. How would you define that concept?
    • How does Metabase assist in that endeavor?
  • What is the ratio of users that take advantage of the GUI query builder as opposed to writing raw SQL?
    • What level of complexity is possible with the query builder?
  • What have you found to be the typical use cases for Metabase in the context of an organization?
  • How do you manage scaling for large or complex queries?
  • What was the motivation for using Clojure as the language for implementing Metabase?
  • What is involved in adding support for a new data source?
  • What are the differentiating features of Metabase that would lead someone to choose it for their organization?
  • What have been the most challenging aspects of building and growing Metabase, both from a technical and business perspective?
  • What do you have planned for the future of Metabase?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Octopai: Metadata Management for Better Business Intelligence with Amnon Drori - Episode 28

Summary

The information about how data is acquired and processed is often as important as the data itself. For this reason metadata management systems are built to track the journey of your business data to aid in analysis, presentation, and compliance. These systems are frequently cumbersome and difficult to maintain, so Octopai was founded to alleviate that burden. In this episode Amnon Drori, CEO and co-founder of Octopai, discusses the business problems he witnessed that led him to starting the company, how their systems are able to provide valuable tools and insights, and the direction that their product will be taking in the future.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 200Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • Your host is Tobias Macey and today I’m interviewing Amnon Drori about OctopAI and the benefits of metadata management

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What is OctopAI and what was your motivation for founding it?
  • What are some of the types of information that you classify and collect as metadata?
  • Can you talk through the architecture of your platform?
  • What are some of the challenges that are typically faced by metadata management systems?
  • What is involved in deploying your metadata collection agents?
  • Once the metadata has been collected what are some of the ways in which it can be used?
  • What mechanisms do you use to ensure that customer data is segregated?
    • How do you identify and handle sensitive information during the collection step?
  • What are some of the most challenging aspects of your technical and business platforms that you have faced?
  • What are some of the plans that you have for OctopAI going forward?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

ThreatStack: Data Driven Cloud Security with Pete Cheslock and Patrick Cable - Episode 25

Summary

Cloud computing and ubiquitous virtualization have changed the ways that our applications are built and deployed. This new environment requires a new way of tracking and addressing the security of our systems. ThreatStack is a platform that collects all of the data that your servers generate and monitors for unexpected anomalies in behavior that would indicate a breach and notifies you in near-realtime. In this episode ThreatStack’s director of operations, Pete Cheslock, and senior infrastructure security engineer, Patrick Cable, discuss the data infrastructure that supports their platform, how they capture and process the data from client systems, and how that information can be used to keep your systems safe from attackers.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • Your host is Tobias Macey and today I’m interviewing Pete Cheslock and Pat Cable about the data infrastructure and security controls at ThreatStack

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Why don’t you start by explaining what ThreatStack does?
    • What was lacking in the existing options (services and self-hosted/open source) that ThreatStack solves for?
  • Can you describe the type(s) of data that you collect and how it is structured?
  • What is the high level data infrastructure that you use for ingesting, storing, and analyzing your customer data?
    • How do you ensure a consistent format of the information that you receive?
    • How do you ensure that the various pieces of your platform are deployed using the proper configurations and operating as intended?
    • How much configuration do you provide to the end user in terms of the captured data, such as sampling rate or additional context?
  • I understand that your original architecture used RabbitMQ as your ingest mechanism, which you then migrated to Kafka. What was your initial motivation for that change?
    • How much of a benefit has that been in terms of overall complexity and cost (both time and infrastructure)?
  • How do you ensure the security and provenance of the data that you collect as it traverses your infrastructure?
  • What are some of the most common vulnerabilities that you detect in your client’s infrastructure?
  • For someone who wants to start using ThreatStack, what does the setup process look like?
  • What have you found to be the most challenging aspects of building and managing the data processes in your environment?
  • What are some of the projects that you have planned to improve the capacity or capabilities of your infrastructure?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Stretching The Elastic Stack with Philipp Krenn - Episode 23

Summary

Search is a common requirement for applications of all varieties. Elasticsearch was built to make it easy to include search functionality in projects built in any language. From that foundation, the rest of the Elastic Stack has been built, expanding to many more use cases in the proces. In this episode Philipp Krenn describes the various pieces of the stack, how they fit together, and how you can use them in your infrastructure to store, search, and analyze your data.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • Your host is Tobias Macey and today I’m interviewing Philipp Krenn about the Elastic Stack and the ways that you can use it in your systems

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • The Elasticsearch product has been around for a long time and is widely known, but can you give a brief overview of the other components that make up the Elastic Stack and how they work together?
  • Beyond the common pattern of using Elasticsearch as a search engine connected to a web application, what are some of the other use cases for the various pieces of the stack?
  • What are the common scaling bottlenecks that users should be aware of when they are dealing with large volumes of data?
  • What do you consider to be the biggest competition to the Elastic Stack as you expand the capabilities and target usage patterns?
  • What are the biggest challenges that you are tackling in the Elastic stack, technical or otherwise?
  • What are the biggest challenges facing Elastic as a company in the near to medium term?
  • Open source as a business model: https://www.elastic.co/blog/doubling-down-on-open
  • What is the vision for Elastic and the Elastic Stack going forward and what new features or functionality can we look forward to?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Honeycomb Data Infrastructure with Sam Stokes - Episode 20

Summary

One of the sources of data that often gets overlooked is the systems that we use to run our businesses. This data is not used to directly provide value to customers or understand the functioning of the business, but it is still a critical component of a successful system. Sam Stokes is an engineer at Honeycomb where he helps to build a platform that is able to capture all of the events and context that occur in our production environments and use them to answer all of your questions about what is happening in your system right now. In this episode he discusses the challenges inherent in capturing and analyzing event data, the tools that his team is using to make it possible, and how this type of knowledge can be used to improve your critical infrastructure.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
  • When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • You can help support the show by checking out the Patreon page which is linked from the site.
  • To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
  • A few announcements:
    • There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20%
    • The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20%
    • If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.
  • Your host is Tobias Macey and today I’m interviewing Sam Stokes about his work at Honeycomb, a modern platform for observability of software systems

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What is Honeycomb and how did you get started at the company?
  • Can you start by giving an overview of your data infrastructure and the path that an event takes from ingest to graph?
  • What are the characteristics of the event data that you are dealing with and what challenges does it pose in terms of processing it at scale?
  • In addition to the complexities of ingesting and storing data with a high degree of cardinality, being able to quickly analyze it for customer reporting poses a number of difficulties. Can you explain how you have built your systems to facilitate highly interactive usage patterns?
  • A high degree of visibility into a running system is desirable for developers and systems adminstrators, but they are not always willing or able to invest the effort to fully instrument the code or servers that they want to track. What have you found to be the most difficult aspects of data collection, and do you have any tooling to simplify the implementation for user?
  • How does Honeycomb compare to other systems that are available off the shelf or as a service, and when is it not the right tool?
  • What have been some of the most challenging aspects of building, scaling, and marketing Honeycomb?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18

Summary

As communications between machines become more commonplace the need to store the generated data in a time-oriented manner increases. The market for timeseries data stores has many contenders, but they are not all built to solve the same problems or to scale in the same manner. In this episode the founders of TimescaleDB, Ajay Kulkarni and Mike Freedman, discuss how Timescale was started, the problems that it solves, and how it works under the covers. They also explain how you can start using it in your infrastructure and their plans for the future.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
  • When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • You can help support the show by checking out the Patreon page which is linked from the site.
  • To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
  • Your host is Tobias Macey and today I’m interviewing Ajay Kulkarni and Mike Freedman about Timescale DB, a scalable timeseries database built on top of PostGreSQL

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Timescale is and how the project got started?
  • The landscape of time series databases is extensive and oftentimes difficult to navigate. How do you view your position in that market and what makes Timescale stand out from the other options?
  • In your blog post that explains the design decisions for how Timescale is implemented you call out the fact that the inserted data is largely append only which simplifies the index management. How does Timescale handle out of order timestamps, such as from infrequently connected sensors or mobile devices?
  • How is Timescale implemented and how has the internal architecture evolved since you first started working on it?
    • What impact has the 10.0 release of PostGreSQL had on the design of the project?
    • Is timescale compatible with systems such as Amazon RDS or Google Cloud SQL?
  • For someone who wants to start using Timescale what is involved in deploying and maintaining it?
  • What are the axes for scaling Timescale and what are the points where that scalability breaks down?
    • Are you aware of anyone who has deployed it on top of Citus for scaling horizontally across instances?
  • What has been the most challenging aspect of building and marketing Timescale?
  • When is Timescale the wrong tool to use for time series data?
  • One of the use cases that you call out on your website is for systems metrics and monitoring. How does Timescale fit into that ecosystem and can it be used along with tools such as Graphite or Prometheus?
  • What are some of the most interesting uses of Timescale that you have seen?
  • Which came first, Timescale the business or Timescale the database, and what is your strategy for ensuring that the open source project and the company around it both maintain their health?
  • What features or improvements do you have planned for future releases of Timescale?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Citus Data: Distributed PostGreSQL for Big Data with Ozgun Erdogan and Craig Kerstiens - Episode 13

Summary

PostGreSQL has become one of the most popular and widely used databases, and for good reason. The level of extensibility that it supports has allowed it to be used in virtually every environment. At Citus Data they have built an extension to support running it in a distributed fashion across large volumes of data with parallelized queries for improved performance. In this episode Ozgun Erdogan, the CTO of Citus, and Craig Kerstiens, Citus Product Manager, discuss how the company got started, the work that they are doing to scale out PostGreSQL, and how you can start using it in your environment.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
  • When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
  • Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • You can help support the show by checking out the Patreon page which is linked from the site.
  • To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
  • Your host is Tobias Macey and today I’m interviewing Ozgun Erdogan and Craig Kerstiens about Citus, worry free PostGreSQL

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Citus is and how the project got started?
  • Why did you start with Postgres vs. building something from the ground up?
  • What was the reasoning behind converting Citus from a fork of PostGres to being an extension and releasing an open source version?
  • How well does Citus work with other Postgres extensions, such as PostGIS, PipelineDB, or Timescale?
  • How does Citus compare to options such as PostGres-XL or the Postgres compatible Aurora service from Amazon?
  • How does Citus operate under the covers to enable clustering and replication across multiple hosts?
  • What are the failure modes of Citus and how does it handle loss of nodes in the cluster?
  • For someone who is interested in migrating to Citus, what is involved in getting it deployed and moving the data out of an existing system?
  • How do the different options for leveraging Citus compare to each other and how do you determine which features to release or withhold in the open source version?
  • Are there any use cases that Citus enables which would be impractical to attempt in native Postgres?
  • What have been some of the most challenging aspects of building the Citus extension?
  • What are the situations where you would advise against using Citus?
  • What are some of the most interesting or impressive uses of Citus that you have seen?
  • What are some of the features that you have planned for future releases of Citus?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

data.world with Bryon Jacob - Episode 9

Summary

We have tools and platforms for collaborating on software projects and linking them together, wouldn’t it be nice to have the same capabilities for data? The team at data.world are working on building a platform to host and share data sets for public and private use that can be linked together to build a semantic web of information. The CTO, Bryon Jacob, discusses how the company got started, their mission, and how they have built and evolved their technical infrastructure.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
  • When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
  • Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • You can help support the show by checking out the Patreon page which is linked from the site.
  • To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
  • This is your host Tobias Macey and today I’m interviewing Bryon Jacob about the technology and purpose that drive data.world

Interview

  • Introduction
  • How did you first get involved in the area of data management?
  • What is data.world and what is its mission and how does your status as a B Corporation tie into that?
  • The platform that you have built provides hosting for a large variety of data sizes and types. What does the technical infrastructure consist of and how has that architecture evolved from when you first launched?
  • What are some of the scaling problems that you have had to deal with as the amount and variety of data that you host has increased?
  • What are some of the technical challenges that you have been faced with that are unique to the task of hosting a heterogeneous assortment of data sets that intended for shared use?
  • How do you deal with issues of privacy or compliance associated with data sets that are submitted to the platform?
  • What are some of the improvements or new capabilities that you are planning to implement as part of the data.world platform?
  • What are the projects or companies that you consider to be your competitors?
  • What are some of the most interesting or unexpected uses of the data.world platform that you are aware of?

Contact Information

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA