Analytics

Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55

Summary

Business intelligence is a necessity for any organization that wants to be able to make informed decisions based on the data that they collect. Unfortunately, it is common for different portions of the business to build their reports with different assumptions, leading to conflicting views and poor choices. Looker is a modern tool for building and sharing reports that makes it easy to get everyone on the same page. In this episode Daniel Mintz explains how the product is architected, the features that make it easy for any business user to access and explore their reports, and how you can use it for your organization today.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Daniel Mintz about Looker, a a modern data platform that can serve the data needs of an entire company

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what Looker is and the problem that it is aiming to solve?
    • How do you define business intelligence?
  • How is Looker unique from other approaches to business intelligence in the enterprise?
    • How does it compare to open source platforms for BI?
  • Can you describe the technical infrastructure that supports Looker?
  • Given that you are connecting to the customer’s data store, how do you ensure sufficient security?
  • For someone who is using Looker, what does their workflow look like?
    • How does that change for different user roles (e.g. data engineer vs sales management)
  • What are the scaling factors for Looker, both in terms of volume of data for reporting from, and for user concurrency?
  • What are the most challenging aspects of building a business intelligence tool and company in the modern data ecosystem?
    • What are the portions of the Looker architecture that you would do differently if you were to start over today?
  • What are some of the most interesting or unusual uses of Looker that you have seen?
  • What is in store for the future of Looker?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Take Control Of Your Web Analytics Using Snowplow With Alexander Dean - Episode 48

Summary

Every business with a website needs some way to keep track of how much traffic they are getting, where it is coming from, and which actions are being taken. The default in most cases is Google Analytics, but this can be limiting when you wish to perform detailed analysis of the captured data. To address this problem, Alex Dean co-founded Snowplow Analytics to build an open source platform that gives you total control of your website traffic data. In this episode he explains how the project and company got started, how the platform is architected, and how you can start using it today to get a clearer view of how your customers are interacting with your web and mobile applications.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • This is your host Tobias Macey and today I’m interviewing Alexander Dean about Snowplow Analytics

Interview

  • Introductions
  • How did you get involved in the area of data engineering and data management?
  • What is Snowplow Analytics and what problem were you trying to solve when you started the company?
  • What is unique about customer event data from an ingestion and processing perspective?
  • Challenges with properly matching up data between sources
  • Data collection is one of the more difficult aspects of an analytics pipeline because of the potential for inconsistency or incorrect information. How is the collection portion of the Snowplow stack designed and how do you validate the correctness of the data?
    • Cleanliness/accuracy
  • What kinds of metrics should be tracked in an ingestion pipeline and how do you monitor them to ensure that everything is operating properly?
  • Can you describe the overall architecture of the ingest pipeline that Snowplow provides?
    • How has that architecture evolved from when you first started?
    • What would you do differently if you were to start over today?
  • Ensuring appropriate use of enrichment sources
  • What have been some of the biggest challenges encountered while building and evolving Snowplow?
  • What are some of the most interesting uses of your platform that you are aware of?

Keep In Touch

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Mobile Data Collection And Analysis Using Ona And Canopy With Peter Lubell-Doughtie - Episode 41

Summary

With the attention being paid to the systems that power large volumes of high velocity data it is easy to forget about the value of data collection at human scales. Ona is a company that is building technologies to support mobile data collection, analysis of the aggregated information, and user-friendly presentations. In this episode CTO Peter Lubell-Doughtie describes the architecture of the platform, the types of environments and use cases where it is being employed, and the value of small data.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Peter Lubell-Doughtie about using Ona for collecting data and processing it with Canopy

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What is Ona and how did the company get started?
    • What are some examples of the types of customers that you work with?
  • What types of data do you support in your collection platform?
  • What are some of the mechanisms that you use to ensure the accuracy of the data that is being collected by users?
  • Does your mobile collection platform allow for anyone to submit data without having to be associated with a given account or organization?
  • What are some of the integration challenges that are unique to the types of data that get collected by mobile field workers?
  • Can you describe the flow of the data from collection through to analysis?
  • To help improve the utility of the data being collected you have started building Canopy. What was the tipping point where it became worth the time and effort to start that project?
    • What are the architectural considerations that you factored in when designing it?
    • What have you found to be the most challenging or unexpected aspects of building an enterprise data warehouse for general users?
  • What are your plans for the future of Ona and Canopy?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Dask with Matthew Rocklin - Episode 2

Summary

There is a vast constellation of tools and platforms for processing and analyzing your data. In this episode Matthew Rocklin talks about how Dask fills the gap between a task oriented workflow tool and an in memory processing framework, and how it brings the power of Python to bear on the problem of big data.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
  • You can help support the show by checking out the Patreon page which is linked from the site.
  • To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
  • Your host is Tobias Macey and today I’m interviewing Matthew Rocklin about Dask and the Blaze ecosystem.

Interview with Matthew Rocklin

  • Introduction
  • How did you get involved in the area of data engineering?
  • Dask began its life as part of the Blaze project. Can you start by describing what Dask is and how it originated?
  • There are a vast number of tools in the field of data analytics. What are some of the specific use cases that Dask was built for that weren’t able to be solved by the existing options?
  • One of the compelling features of Dask is the fact that it is a Python library that allows for distributed computation at a scale that has largely been the exclusive domain of tools in the Hadoop ecosystem. Why do you think that the JVM has been the reigning platform in the data analytics space for so long?
  • Do you consider Dask, along with the larger Blaze ecosystem, to be a competitor to the Hadoop ecosystem, either now or in the future?
  • Are you seeing many Hadoop or Spark solutions being migrated to Dask? If so, what are the common reasons?
  • There is a strong focus for using Dask as a tool for interactive exploration of data. How does it compare to something like Apache Drill?
  • For anyone looking to integrate Dask into an existing code base that is already using NumPy or Pandas, what does that process look like?
  • How do the task graph capabilities compare to something like Airflow or Luigi?
  • Looking through the documentation for the graph specification in Dask, it appears that there is the potential to introduce cycles or other bugs into a large or complex task chain. Is there any built-in tooling to check for that before submitting the graph for execution?
  • What are some of the most interesting or unexpected projects that you have seen Dask used for?
  • What do you perceive as being the most relevant aspects of Dask for data engineering/data infrastructure practitioners, as compared to the end users of the systems that they support?
  • What are some of the most significant problems that you have been faced with, and which still need to be overcome in the Dask project?
  • I know that the work on Dask is largely performed under the umbrella of PyData and sponsored by Continuum Analytics. What are your thoughts on the financial landscape for open source data analytics and distributed computation frameworks as compared to the broader world of open source projects?

Keep in touch

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA