Scale Your Analytics On The Clickhouse Data Warehouse - Episode 88

Summary

The market for data warehouse platforms is large and varied, with options for every use case. ClickHouse is an open source, column-oriented database engine built for interactive analytics with linear scalability. In this episode Robert Hodges and Alexander Zaitsev explain how it is architected to provide these features, the various unique capabilities that it provides, and how to run it in production. It was interesting to learn about some of the custom data types and performance optimizations that are included.

linode-banner-sponsor-largeDo you want to try out some of the tools and applications that you heard about on the Data Engineering Podcast? Do you have some ETL jobs that need somewhere to run? Check out Linode at linode.com/dataengineeringpodcast or use the code dataengineering2019 and get a $20 credit (that’s 4 months free!) to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.


Integrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin.


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Integrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin. And don’t forget to thank them for supporting the show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Robert Hodges and Alexander Zaitsev about Clickhouse, an open source, column-oriented database for fast and scalable OLAP queries

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Clickhouse is and how you each got involved with it?
    • What are the primary use cases that Clickhouse is targeting?
    • Where does it fit in the database market and how does it compare to other column stores, both open source and commercial?
  • Can you describe how Clickhouse is architected?
  • Can you talk through the lifecycle of a given record or set of records from when they first get inserted into Clickhouse, through the engine and storage layer, and then the lookup process at query time?
    • I noticed that Clickhouse has a feature for implementing data safeguards (deletion protection, etc.). Can you talk through how that factors into different use cases for Clickhouse?
  • Aside from directly inserting a record via the client APIs can you talk through the options for loading data into Clickhouse?
    • For the MySQL/Postgres replication functionality how do you maintain schema evolution from the source DB to Clickhouse?
  • What are some of the advanced capabilities, such as SQL extensions, supported data types, etc. that are unique to Clickhouse?
  • For someone getting started with Clickhouse can you describe how they should be thinking about data modeling?
  • Recent entrants to the data warehouse market are encouraging users to insert raw, unprocessed records and then do their transformations with the database engine, as opposed to using a data lake as the staging ground for transformations prior to loading into the warehouse. Where does Clickhouse fall along that spectrum?
  • How is scaling in Clickhouse implemented and what are the edge cases that users should be aware of?
    • How is data replication and consistency managed?
  • What is involved in deploying and maintaining an installation of Clickhouse?
    • I noticed that Altinity is providing a Kubernetes operator for Clickhouse. What are the opportunities and tradeoffs presented by that platform for Clickhouse?
  • What are some of the most interesting/unexpected/innovative ways that you have seen Clickhouse used?
  • What are some of the most challenging aspects of working on Clickhouse itself, and or implementing systems on top of it?
  • What are the shortcomings of Clickhouse and how do you address them at Altinity?
  • When is Clickhouse the wrong choice?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Liked it? Take a second to support the Data Engineering Podcast on Patreon!
Scale Your Analytics On The Clickhouse Data Warehouse 1