Distributed systems are complex to build and operate, and there are certain primitives that are common to a majority of them. Rather then re-implement the same capabilities every time, many projects build on top of Apache Zookeeper. In this episode Patrick Hunt explains how the Apache Zookeeper project was started, how it functions, and how it is used as a building block for other distributed systems. He also explains the operational considerations for running your own cluster, how it compares to more recent entrants such as Consul and EtcD, and what is in store for the future.
Do you want to try out some of the tools and applications that you heard about on the Data Engineering Podcast? Do you have some ETL jobs that need somewhere to run? Check out Linode at promo.linode.com/dataengineeringpodcast or use the code dataengineering2018 and get a $20 credit (that’s 4 months free!) to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Patrick Hunt about Apache Zookeeper and how it is used as a building block for distributed systems
- How did you get involved in the area of data management?
- Can you start by explaining what Zookeeper is and how the project got started?
- What are the main motivations for using a centralized coordination service for distributed systems?
- What are the distributed systems primitives that are built into Zookeeper?
- What are some of the higher-order capabilities that Zookeeper provides to users who are building distributed systems on top of Zookeeper?
- What are some of the types of system level features that application developers will need which aren’t provided by Zookeeper?
- Can you discuss how Zookeeper is architected and how that design has evolved over time?
- What have you found to be some of the most complicated or difficult aspects of building and maintaining Zookeeper?
- What are the scaling factors for Zookeeper?
- What are the edge cases that users should be aware of?
- Where does it fall on the axes of the CAP theorem?
- What are the main failure modes for Zookeeper?
- How much of the recovery logic is left up to the end user of the Zookeeper cluster?
- Since there are a number of projects that rely on Zookeeper, many of which are likely to be run in the same environment (e.g. Kafka and Flink), what would be involved in sharing a single Zookeeper cluster among those multiple services?
- In recent years we have seen projects such as EtcD which is used by Kubernetes, and Consul. How does Zookeeper compare with those projects?
- What are some of the cases where Zookeeper is the wrong choice?
- How have the needs of distributed systems engineers changed since you first began working on Zookeeper?
- If you were to start the project over today, what would you do differently?
- Would you still use Java?
- What are some of the most interesting or unexpected ways that you have seen Zookeeper used?
- What do you have planned for the future of Zookeeper?
- @phunt on Twitter
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Google Chubby
- High Availability
- Fallacies of distributed computing
- Falsehoods programmers believe about networking
- Apache Curator
- Raft Consensus Algorithm
- Zookeeper Atomic Broadcast
- SSD Write Cliff
- Apache Kafka
- Apache Flink
- Protocol Buffers