Workflows

Building A Knowledge Graph From Public Data At Enigma With Chris Groskopf - Episode 50

Summary

There are countless sources of data that are publicly available for use. Unfortunately, combining those sources and making them useful in aggregate is a time consuming and challenging process. The team at Enigma builds a knowledge graph for use in your own data projects. In this episode Chris Groskopf explains the platform they have built to consume large varieties and volumes of public data for constructing a graph for serving to their customers. He discusses the challenges they are facing to scale the platform and engineering processes, as well as the workflow that they have established to enable testing of their ETL jobs. This is a great episode to listen to for ideas on how to organize a data engineering organization.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Chris Groskopf about Enigma and how the are using public data sources to build a knowledge graph

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you give a brief overview of what Enigma has built and what the motivation was for starting the company?
    • How do you define the concept of a knowledge graph?
  • What are the processes involved in constructing a knowledge graph?
  • Can you describe the overall architecture of your data platform and the systems that you use for storing and serving your knowledge graph?
  • What are the most challenging or unexpected aspects of building the knowledge graph that you have encountered?
    • How do you manage the software lifecycle for your ETL code?
    • What kinds of unit, integration, or acceptance tests do you run to ensure that you don’t introduce regressions in your processing logic?
  • What are the current challenges that you are facing in building and scaling your data infrastructure?
    • How does the fact that your data sources are primarily public influence your pipeline design and what challenges does it pose?
    • What techniques are you using to manage accuracy and consistency in the data that you ingest?
  • Can you walk through the lifecycle of the data that you process from acquisition through to delivery to your customers?
  • What are the weak spots in your platform that you are planning to address in upcoming projects?
    • If you were to start from scratch today, what would you have done differently?
  • What are some of the most interesting or unexpected uses of your product that you have seen?
  • What is in store for the future of Enigma?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Putting Airflow Into Production With James Meickle - Episode 43

Summary

The theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning the pitfalls and best practices from someone who has gained that knowledge the hard way can save you from wasted time and frustration. In this episode James Meickle discusses his recent experience building a new installation of Airflow. He points out the strengths, design flaws, and areas of improvement for the framework. He also describes the design patterns and workflows that his team has built to allow them to use Airflow as the basis of their data science platform.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing James Meickle about his experiences building a new Airflow installation

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What was your initial project requirement?
    • What tooling did you consider in addition to Airflow?
    • What aspects of the Airflow platform led you to choose it as your implementation target?
  • Can you describe your current deployment architecture?
    • How many engineers are involved in writing tasks for your Airflow installation?
  • What resources were the most helpful while learning about Airflow design patterns?
    • How have you architected your DAGs for deployment and extensibility?
  • What kinds of tests and automation have you put in place to support the ongoing stability of your deployment?
  • What are some of the dead-ends or other pitfalls that you encountered during the course of this project?
  • What aspects of Airflow have you found to be lacking that you would like to see improved?
  • What did you wish someone had told you before you started work on your Airflow installation?
    • If you were to start over would you make the same choice?
    • If Airflow wasn’t available what would be your second choice?
  • What are your next steps for improvements and fixes?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Dask with Matthew Rocklin - Episode 2

Summary

There is a vast constellation of tools and platforms for processing and analyzing your data. In this episode Matthew Rocklin talks about how Dask fills the gap between a task oriented workflow tool and an in memory processing framework, and how it brings the power of Python to bear on the problem of big data.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • You can help support the show by checking out the Patreon page which is linked from the site.
  • To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
  • Your host is Tobias Macey and today I’m interviewing Matthew Rocklin about Dask and the Blaze ecosystem.

Interview with Matthew Rocklin

  • Introduction
  • How did you get involved in the area of data engineering?
  • Dask began its life as part of the Blaze project. Can you start by describing what Dask is and how it originated?
  • There are a vast number of tools in the field of data analytics. What are some of the specific use cases that Dask was built for that weren’t able to be solved by the existing options?
  • One of the compelling features of Dask is the fact that it is a Python library that allows for distributed computation at a scale that has largely been the exclusive domain of tools in the Hadoop ecosystem. Why do you think that the JVM has been the reigning platform in the data analytics space for so long?
  • Do you consider Dask, along with the larger Blaze ecosystem, to be a competitor to the Hadoop ecosystem, either now or in the future?
  • Are you seeing many Hadoop or Spark solutions being migrated to Dask? If so, what are the common reasons?
  • There is a strong focus for using Dask as a tool for interactive exploration of data. How does it compare to something like Apache Drill?
  • For anyone looking to integrate Dask into an existing code base that is already using NumPy or Pandas, what does that process look like?
  • How do the task graph capabilities compare to something like Airflow or Luigi?
  • Looking through the documentation for the graph specification in Dask, it appears that there is the potential to introduce cycles or other bugs into a large or complex task chain. Is there any built-in tooling to check for that before submitting the graph for execution?
  • What are some of the most interesting or unexpected projects that you have seen Dask used for?
  • What do you perceive as being the most relevant aspects of Dask for data engineering/data infrastructure practitioners, as compared to the end users of the systems that they support?
  • What are some of the most significant problems that you have been faced with, and which still need to be overcome in the Dask project?
  • I know that the work on Dask is largely performed under the umbrella of PyData and sponsored by Continuum Analytics. What are your thoughts on the financial landscape for open source data analytics and distributed computation frameworks as compared to the broader world of open source projects?

Keep in touch

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA