CouchDB is a distributed document database built for scale and ease of operation. With a built-in synchronization protocol and a HTTP interface it has become popular as a backend for web and mobile applications. Created 15 years ago, it has accrued some technical debt which is being addressed with a refactored architecture based on FoundationDB. In this episode Adam Kocoloski shares the history of the project, how it works under the hood, and how the new design will improve the project for our new era of computation. This was an interesting conversation about the challenges of maintaining a large and mission critical project and the work being done to evolve it.
Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $60 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!
Many data engineers say the most frustrating part of their job is spending too much time maintaining and monitoring their data pipeline. Snowplow works with data-informed businesses to set up a real-time event data pipeline, taking care of installation, upgrades, autoscaling, and ongoing maintenance so you can focus on the data.
Snowplow runs in your own cloud account giving you complete control and flexibility over how your data is collected and processed. Best of all, Snowplow is built on top of open source technology which means you have visibility into every stage of your pipeline, with zero vendor lock in.
At Snowplow, we know how important it is for data engineers to deliver high-quality data across the organization. That’s why the Snowplow pipeline is designed to deliver complete, rich and accurate data into your data warehouse of choice. Your data analysts define the data structure that works best for your teams, and we enforce it end-to-end so your data is ready to use.
Get in touch with our team to find out how Snowplow can accelerate your analytics. Go to dataengineeringpodcast.com/snowplow. Set up a demo and mention you’re a listener for a special offer!
Enabling real-time analytics is a huge task. Without a data warehouse that outperforms the demands of your customers at a fraction of cost and time, this big task can also prove challenging. But it doesn’t have to be tiring or difficult with ClickHouse — an open-source analytical database that deploys and scales wherever and whenever you want it to and turns data into actionable revenue. And Altinity is the leading ClickHouse software and service provider on a mission to help data engineers and DevOps managers. Go to dataengineeringpodcast.com/altinity to find out how with a free consultation.
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- Are you spending too much time maintaining your data pipeline? Snowplow empowers your business with a real-time event data pipeline running in your own cloud account without the hassle of maintenance. Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and autoscaling so you can focus on your exciting data projects. Your team will get the most complete, accurate and ready-to-use behavioral web and mobile data, delivered into your data warehouse, data lake and real-time streams. Go to dataengineeringpodcast.com/snowplow today to find out why more than 600,000 websites run Snowplow. Set up a demo and mention you’re a listener for a special offer!
- Setting up and managing a data warehouse for your business analytics is a huge task. Integrating real-time data makes it even more challenging, but the insights you obtain can make or break your business growth. You deserve a data warehouse engine that outperforms the demands of your customers and simplifies your operations at a fraction of the time and cost that you might expect. You deserve ClickHouse, the open-source analytical database that deploys and scales wherever and whenever you want it to and turns data into actionable insights. And Altinity, the leading software and service provider for ClickHouse, is on a mission to help data engineers and DevOps managers tame their operational analytics. Go to dataengineeringpodcast.com/altinity for a free consultation to find out how they can help you today.
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Your host is Tobias Macey and today I’m interviewing Adam Kocoloski about CouchDB and the work being done to migrate the storage layer to FoundationDB
- How did you get involved in the area of data management?
- Can you starty by describing what CouchDB is?
- How did you get involved in the CouchDB project and what is your current role in the community?
- What are the use cases that it is well suited for?
- Can you share some of the history of CouchDB and its role in the NoSQL movement?
- How is CouchDB currently architected and how has it evolved since it was first introduced?
- What have been the benefits and challenges of Erlang as the runtime for CouchDB?
- How is the current storage engine implemented and what are its shortcomings?
- What problems are you trying to solve by replatforming on a new storage layer?
- What were the selection criteria for the new storage engine and how did you structure the decision making process?
- What was the motivation for choosing FoundationDB as opposed to other options such as rocksDB, levelDB, etc.?
- How is the adoption of FoundationDB going to impact the overall architecture and implementation of CouchDB?
- How will the use of FoundationDB impact the way that the current capabilities are implemented, such as data replication?
- What will the migration path be for people running an existing installation?
- What are some of the biggest challenges that you are facing in rearchitecting the codebase?
- What new capabilities will the FoundationDB storage layer enable?
- What are some of the most interesting/unexpected/innovative ways that you have seen CouchDB used?
- What new capabilities or use cases do you anticipate once this migration is complete?
- What are some of the most interesting/unexpected/challenging lessons that you have learned while working with the CouchDB project and community?
- What is in store for the future of CouchDB?
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Apache CouchDB
- Experimental Particle Physics
- FPGA == Field Programmable Gate Array
- Apache Software Foundation
- CRDT == Conflict-free Replicated Data Type
- Property Based Testing