{"version":"https://jsonfeed.org/version/1","title":"Data Engineering Podcast","home_page_url":"https://www.dataengineeringpodcast.com","feed_url":"https://www.dataengineeringpodcast.com/json","description":"This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.","_fireside":{"subtitle":"Weekly deep dives on data management with the engineers and entrepreneurs who are shaping the industry","pubdate":"2024-04-21T17:00:00.000-04:00","explicit":false,"copyright":"2024 by Boundless Notions, LLC.","owner":"Tobias Macey","image":"https://assets.fireside.fm/file/fireside-images/podcasts/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/cover.jpg?v=1"},"items":[{"id":"2335d9ff-5fda-498e-a649-355de6c98444","title":"Making Email Better With AI At Shortwave","url":"https://www.dataengineeringpodcast.com/shortwave-ai-powered-email-episode-422","content_text":"Summary\n\nGenerative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his team have overcome in integrating AI into their product, as well as the benefits and features that it provides to their customers.\n\nAnnouncements\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nDagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!\nData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.\nYour host is Tobias Macey and today I'm interviewing Andrew Lee about his work on Shortwave, an AI powered email client\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Shortwave is and the story behind it?\n\n\nWhat is the core problem that you are addressing with Shortwave?\n\nEmail has been a central part of communication and business productivity for decades now. What are the overall themes that continue to be problematic?\nWhat are the strengths that email maintains as a protocol and ecosystem?\nFrom a product perspective, what are the data challenges that are posed by email?\nCan you describe how you have architected the Shortwave platform?\n\n\nHow have the design and goals of the product changed since you started it?\nWhat are the ways that the advent and evolution of language models have influenced your product roadmap?\n\nHow do you manage the personalization of the AI functionality in your system for each user/team?\nFor users and teams who are using Shortwave, how does it change their workflow and communication patterns?\nCan you describe how I would use Shortwave for managing the workflow of evaluating, planning, and promoting my podcast episodes?\nWhat are the most interesting, innovative, or unexpected ways that you have seen Shortwave used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Shortwave?\nWhen is Shortwave the wrong choice?\nWhat do you have planned for the future of Shortwave?\n\n\nContact Info\n\n\nLinkedIn\nBlog\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nClosing Announcements\n\n\nThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\n\n\nLinks\n\n\nShortwave\nFirebase\nGoogle Inbox\nHey\n\n\nEzra Klein Hey Article\n\nSuperhuman\nPinecone\n\n\nPodcast Episode\n\nElastic\nHybrid Search\nSemantic Search\nMistral\nGPT 3.5\nIMAP\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png)\r\n\r\nThis episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. \r\n\r\nTrusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. <u>[dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)</u>Dagster: ![Dagster Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/jz4xfquZ.png)\r\n\r\nData teams are tasked with helping organizations deliver on the premise of data, and with ML and AI maturing rapidly, expectations have never been this high. However data engineers are challenged by both technical complexity and organizational complexity, with heterogeneous technologies to adopt, multiple data disciplines converging, legacy systems to support, and costs to manage. \r\n\r\nDagster is an open-source orchestration solution that helps data teams reign in this complexity and build data platforms that provide unparalleled observability, and testability, all while fostering collaboration across the enterprise. With enterprise-grade hosting on Dagster Cloud, you gain even more capabilities, adding cost management, security, and CI support to further boost your teams' productivity. Go to <u>[dagster.io](https://dagster.io/lp/dagster-cloud-trial?source=data-eng-podcast)</u> today to get your first 30 days free!","content_html":"
Generative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his team have overcome in integrating AI into their product, as well as the benefits and features that it provides to their customers.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Maintaining a single source of truth for your data is the biggest challenge in data engineering. Different roles and tasks in the business need their own ways to access and analyze the data in the organization. In order to enable this use case, while maintaining a single point of access, the semantic layer has evolved as a technological solution to the problem. In this episode Artyom Keydunov, creator of Cube, discusses the evolution and applications of the semantic layer as a component of your data platform, and how Cube provides speed and cost optimization for your data consumers.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
A core differentiator of Dagster in the ecosystem of data orchestration is their focus on software defined assets as a means of building declarative workflows. With their launch of Dagster+ as the redesigned commercial companion to the open source project they are investing in that capability with a suite of new features. In this episode Pete Hunt, CEO of Dagster labs, outlines these new capabilities, how they reduce the burden on data teams, and the increased collaboration that they enable across teams and business units.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
A significant portion of data workflows involve storing and processing information in database engines. Validating that the information is stored and processed correctly can be complex and time-consuming, especially when the source and destination speak different dialects of SQL. In this episode Gleb Mezhanskiy, founder and CEO of Datafold, discusses the different error conditions and solutions that you need to know about to ensure the accuracy of your data.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Data lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between data lake and warehouse capabilities is the catalog. The primary purpose of the catalog is to inform the query engine of what data exists and where, but the Nessie project aims to go beyond that simple utility. In this episode Alex Merced explains how the branching and merging functionality in Nessie allows you to use the same versioning semantics for your data lakehouse that you are used to from Git.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Artificial intelligence technologies promise to revolutionize business and produce new sources of value. In order to make those promises a reality there is a substantial amount of strategy and investment required. Colleen Tartow has worked across all stages of the data lifecycle, and in this episode she shares her hard-earned wisdom about how to conduct an AI program for your organization.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
A data lakehouse is intended to combine the benefits of data lakes (cost effective, scalable storage and compute) and data warehouses (user friendly SQL interface). Multiple open source projects and vendors have been working together to make this vision a reality. In this episode Dain Sundstrom, CTO of Starburst, explains how the combination of the Trino query engine and the Iceberg table format offer the ease of use and execution speed of data warehouses with the infinite storage and scalability of data lakes.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Databases and analytics architectures have gone through several generational shifts. A substantial amount of the data that is being managed in these systems is related to customers and their interactions with an organization. In this episode Tasso Argyros, CEO of ActionIQ, gives a summary of the major epochs in database technologies and how he is applying the capabilities of cloud data warehouses to the challenge of building more comprehensive experiences for end-users through a modern customer data platform (CDP).
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Data processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up. As the sophistication increases, so does the complexity, leading to challenges for user experience. Jignesh Patel has been researching these areas for several years in his work as a professor at Carnegie Mellon University. In this episode he illuminates the landscape of problems that we are faced with and how his research is aimed at helping to solve these problems.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Working with financial data requires a high degree of rigor due to the numerous regulations and the risks involved in security breaches. In this episode Andrey Korchack, CTO of fintech startup Monite, discusses the complexities of designing and implementing a data platform in that sector.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Operating it at scale, however, is notoriously challenging. Elad Eldor has experienced these challenges first-hand, leading to his work writing the book "Kafka: : Troubleshooting in Production". In this episode he highlights the sources of complexity that contribute to Kafka's operational difficulties, and some of the main ways to identify and mitigate potential sources of trouble.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
The "modern data stack" promised a scalable, composable data platform that gave everyone the flexibility to use the best tools for every job. The reality was that it left data teams in the position of spending all of their engineering effort on integrating systems that weren't designed with compatible user experiences. The team at 5X understand the pain involved and the barriers to productivity and set out to solve it by pre-integrating the best tools from each layer of the stack. In this episode founder Tarush Aggarwal explains how the realities of the modern data stack are impacting data teams and the work that they are doing to accelerate time to value.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
If your business metrics looked weird tomorrow, would you know about it first? Anomaly detection is focused on identifying those outliers for you, so that you are the first to know when a business critical dashboard isn't right. Unfortunately, it can often be complex or expensive to incorporate anomaly detection into your data platform. Andrew Maguire got tired of solving that problem for each of the different roles he has ended up in, so he created the open source Anomstack project. In this episode he shares what it is, how it works, and how you can start using it today to get notified when the critical metrics in your business aren't quite right.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
The first step of data pipelines is to move the data to a place where you can process and prepare it for its eventual purpose. Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor. Andrei Tserakhau has dedicated his careeer to this problem, and in this episode he shares the lessons that he has learned and the work he is doing on his most recent data transfer system at DoubleCloud.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\n\nSpeaker - Andrei Tserakhau, DoubleCloud Tech Lead. He has over 10 years of IT engineering experience and for the last 4 years was working on distributed systems with a focus on data delivery systems.
Sponsored By:
Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
The dbt project has become overwhelmingly popular across analytics and data engineering teams. While it is easy to adopt, there are many potential pitfalls. Dustin Dorsey and Cameron Cyr co-authored a practical guide to building your dbt project. In this episode they share their hard-won wisdom about how to build and scale your dbt projects.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Software development involves an interesting balance of creativity and repetition of patterns. Generative AI has accelerated the ability of developer tools to provide useful suggestions that speed up the work of engineers. Tabnine is one of the main platforms offering an AI powered assistant for software engineers. In this episode Eran Yahav shares the journey that he has taken in building this product and the ways that it enhances the ability of humans to get their work done, and when the humans have to adapt to the tool.
\n\nThe intro and outro music is from Hitman's Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0
Sponsored By:
Databases are the core of most applications, but they are often treated as inscrutable black boxes. When an application is slow, there is a good probability that the database needs some attention. In this episode Lukas Fittl shares some hard-won wisdom about the causes and solution of many performance bottlenecks and the work that he is doing to shine some light on PostgreSQL to make it easier to understand how to keep it running smoothly.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Databases are the core of most applications, whether transactional or analytical. In recent years the selection of database products has exploded, making the critical decision of which engine(s) to use even more difficult. In this episode Tanya Bragin shares her experiences as a product manager for two major vendors and the lessons that she has learned about how teams should approach the process of tool selection.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Building streaming applications has gotten substantially easier over the past several years. Despite this, it is still operationally challenging to deploy and maintain your own stream processing infrastructure. Decodable was built with a mission of eliminating all of the painful aspects of developing and deploying stream processing systems for engineering teams. In this episode Eric Sammer discusses why more companies are including real-time capabilities in their products and the ways that Decodable makes it faster and easier.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
The insurance industry is notoriously opaque and hard to navigate. Max Cho found that fact frustrating enough that he decided to build a business of making policy selection more navigable. In this episode he shares his journey of data collection and analysis and the challenges of automating an intentionally manual industry.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Artificial intelligence applications require substantial high quality data, which is provided through ETL pipelines. Now that AI has reached the level of sophistication seen in the various generative models it is being used to build new ETL workflows. In this episode Jay Mishra shares his experiences and insights building ETL pipelines with the help of generative AI.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
The rapid growth of machine learning, especially large language models, have led to a commensurate growth in the need to store and compare vectors. In this episode Louis Brandy discusses the applications for vector search capabilities both in and outside of AI, as well as the challenges of maintaining real-time indexes of vector data.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. In this episode Brian Platz explains how JSON-LD can be used as a shared representation of linked data for building semantic data products.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Data pipelines are the core of every data product, ML model, and business intelligence dashboard. If you're not careful you will end up spending all of your time on maintenance and fire-fighting. The folks at Rivery distilled the seven principles of modern data pipelines that will help you stay out of trouble and be productive with your data. In this episode Ariel Pohoryles explains what they are and how they work together to increase your chances of success.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
As businesses increasingly invest in technology and talent focused on data engineering and analytics, they want to know whether they are benefiting. So how do you calculate the return on investment for data? In this episode Barr Moses and Anna Filippova explore that question and provide useful exercises to start answering that in your company.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
All software systems are in a constant state of evolution. This makes it impossible to select a truly future-proof technology stack for your data platform, making an eventual migration inevitable. In this episode Gleb Mezhanskiy and Rob Goretsky share their experiences leading various data platform migrations, and the hard-won lessons that they learned so that you don't have to.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Real-time data processing has steadily been gaining adoption due to advances in the accessibility of the technologies involved. Despite that, it is still a complex set of capabilities. To bring streaming data in reach of application engineers Matteo Pelati helped to create Dozer. In this episode he explains how investing in high performance and operationally simplified streaming with a familiar API can yield significant benefits for software and data teams together.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Data has been one of the most substantial drivers of business and economic value for the past few decades. Bob Muglia has had a front-row seat to many of the major shifts driven by technology over his career. In his recent book "Datapreneurs" he reflects on the people and businesses that he has known and worked with and how they relied on data to deliver valuable services and drive meaningful change.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
For business analytics the way that you model the data in your warehouse has a lasting impact on what types of questions can be answered quickly and easily. The major strategies in use today were created decades ago when the software and hardware for warehouse databases were far more constrained. In this episode Maxime Beauchemin of Airflow and Superset fame shares his vision for the entity-centric data model and how you can incorporate it into your own warehouse design.
\n\nCan you describe what entity-centric modeling (ECM) is and the story behind it?
\n\nWhat impact does this have on ML teams? (e.g. feature engineering)
What role does the tooling of a team have in the ways that they end up thinking about modeling? (e.g. dbt vs. informatica vs. ETL scripts, etc.)
\n\nWhat are some examples of data sources or problem domains for which this approach is well suited?
\n\nWhat are the ways that the benefits of ECM manifest in use cases that are down-stream from the warehouse?
What are some concrete tactical steps that teams should be thinking about to implement a workable domain model using entity-centric principles?
\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen ECM used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on ECM?
When is ECM the wrong choice?
What are your predictions for the future direction/adoption of ECM or other modeling techniques?
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Feature engineering is a crucial aspect of the machine learning workflow. To make that possible, there are a number of technical and procedural capabilities that must be in place first. In this episode Razi Raziuddin shares how data engineering teams can support the machine learning workflow through the development and support of systems that empower data scientists and ML engineers to build and maintain their own features.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Data transformation is a key activity for all of the organizational roles that interact with data. Because of its importance and outsized impact on what is possible for downstream data consumers it is critical that everyone is able to collaborate seamlessly. SQLMesh was designed as a unifying tool that is simple to work with but powerful enough for large-scale transformations and complex projects. In this episode Toby Mao explains how it works, the importance of automatic column-level lineage tracking, and how you can start using it today.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Architectural decisions are all based on certain constraints and a desire to optimize for different outcomes. In data systems one of the core architectural exercises is data modeling, which can have significant impacts on what is and is not possible for downstream use cases. By incorporating column-level lineage in the data modeling process it encourages a more robust and well-informed design. In this episode Satish Jayanthi explores the benefits of incorporating column-aware tooling in the data modeling process.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Data engineering is all about building workflows, pipelines, systems, and interfaces to provide stable and reliable data. Your data can be stable and wrong, but then it isn't reliable. Confidence in your data is achieved through constant validation and testing. Datafold has invested a lot of time into integrating with the workflow of dbt projects to add early verification that the changes you are making are correct. In this episode Gleb Mezhanskiy shares some valuable advice and insights into how you can build reliable and well-tested data assets with dbt and data-diff.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Special Guest: Gleb Mezhanskiy.
Sponsored By:
A significant portion of the time spent by data engineering teams is on managing the workflows and operations of their pipelines. DataOps has arisen as a parallel set of practices to that of DevOps teams as a means of reducing wasted effort. Agile Data Engine is a platform designed to handle the infrastructure side of the DataOps equation, as well as providing the insights that you need to manage the human side of the workflow. In this episode Tevje Olin explains how the platform is implemented, the features that it provides to reduce the amount of effort required to keep your pipelines running, and how you can start using it in your own team.
\n\nAgile Data Engine unlocks the potential of your data to drive business value - in a rapidly changing world.
\nAgile Data Engine is a DataOps Management platform for designing, deploying, operating and managing data products, and managing the whole lifecycle of a data warehouse. It combines data modeling, transformations, continuous delivery and workload orchestration into the same platform.
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Building a data team is hard in any circumstance, but at a startup it can be even more challenging. The requirements are fluid, you probably don't have a lot of existing data talent to manage the hiring and onboarding, and there is a need to move fast. Ghalib Suleiman has been on both sides of this equation and joins the show to share his hard-won wisdom about how to start and grow a data team in the early days of company growth.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Batch vs. streaming is a long running debate in the world of data integration and transformation. Proponents of the streaming paradigm argue that stream processing engines can easily handle batched workloads, but the reverse isn't true. The batch world has been the default for years because of the complexities of running a reliable streaming system at scale. In order to remove that barrier, the team at Estuary have built the Gazette and Flow systems from the ground up to resolve the pain points of other streaming engines, while providing an intuitive interface for data and application engineers to build their streaming workflows. In this episode David Yaffe and Johnny Graettinger share the story behind the business and technology and how you can start using it today to build a real-time data lake without all of the headache.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
All of the advancements in our technology is based around the principles of abstraction. These are valuable until they break down, which is an inevitable occurrence. In this episode the host Tobias Macey shares his reflections on recent experiences where the abstractions leaked and some observances on how to deal with that situation in a data platform architecture.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Every business has customers, and a critical element of success is understanding who they are and how they are using the companies products or services. The challenge is that most companies have a multitude of systems that contain fragments of the customer's interactions and stitching that together is complex and time consuming. Segment created the Unify product to reduce the burden of building a comprehensive view of customers and synchronizing it to all of the systems that need it. In this episode Kevin Niparko and Hanhan Wang share the details of how it is implemented and how you can use it to build and maintain rich customer profiles.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Real-time capabilities have quickly become an expectation for consumers. The complexity of providing those capabilities is still high, however, making it more difficult for small teams to compete. Meroxa was created to enable teams of all sizes to deliver real-time data applications. In this episode DeVaris Brown discusses the types of applications that are possible when teams don't have to manage the complex infrastructure necessary to support continuous data flows.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Business intellingence has been chasing the promise of self-serve data for decades. As the capabilities of these systems has improved and become more accessible, the target of what self-serve means changes. With the availability of AI powered by large language models combined with the evolution of semantic layers, the team at Zenlytic have taken aim at this problem again. In this episode Paul Blankley and Ryan Janssen explore the power of natural language driven data exploration combined with semantic modeling that enables an intuitive way for everyone in the business to access the data that they need to succeed in their work.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
The customer data platform is a category of services that was developed early in the evolution of the current era of cloud services for data processing. When it was difficult to wire together the event collection, data modeling, reporting, and activation it made sense to buy monolithic products that handled every stage of the customer data lifecycle. Now that the data warehouse has taken center stage a new approach of composable customer data platforms is emerging. In this episode Darren Haken is joined by Tejas Manohar to discuss how Autotrader UK is addressing their customer data needs by building on top of their existing data stack.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
The data ecosystem has been building momentum for several years now. As a venture capital investor Matt Turck has been trying to keep track of the main trends and has compiled his findings into the MAD (ML, AI, and Data) landscape reports each year. In this episode he shares his experiences building those reports and the perspective he has gained from the exercise.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
The promise of streaming data is that it allows you to react to new information as it happens, rather than introducing latency by batching records together. The peril is that building a robust and scalable streaming architecture is always more complicated and error-prone than you think it's going to be. After experiencing this unfortunate reality for themselves, Abhishek Chauhan and Ashish Kumar founded Grainite so that you don't have to suffer the same pain. In this episode they explain why streaming architectures are so challenging, how they have designed Grainite to be robust and scalable, and how you can start using it today to build your streaming data applications without all of the operational headache.
\n\nWhat are some of the most complex aspects of building streaming data applications in the absence of something like Grainite?
\n\nWhat are some of the commonalities that you see in the teams/organizations that find their way to Grainite?
What are some of the higher-order projects that teams are able to build when they are using Grainite as a starting point vs. where they would be spending effort on a fully managed streaming architecture?
Can you describe how Grainite is architected?
\n\nWhat does your internal build vs. buy process look like for identifying where to spend your engineering resources?
What is the process for getting Grainite set up and integrated into an organizations technical environment?
\n\nOnce Grainite is running, can you describe the day 0 workflow of building an application or data flow?
\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Grainite used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Grainite?
When is Grainite the wrong choice?
What do you have planned for the future of Grainite?
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
As with all aspects of technology, security is a critical element of data applications, and the different controls can be at cross purposes with productivity. In this episode Yoav Cohen from Satori shares his experiences as a practitioner in the space of data security and how to align with the needs of engineers and business users. He also explains why data security is distinct from application security and some methods for reducing the challenge of working across different data systems.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
With the rise of the web and digital business came the need to understand how customers are interacting with the products and services that are being sold. Product analytics has grown into its own category and brought with it several services with generational differences in how they approach the problem. NetSpring is a warehouse-native product analytics service that allows you to gain powerful insights into your customers and their needs by combining your event streams with the rest of your business data. In this episode Priyendra Deshwal explains how NetSpring is designed to empower your product and data teams to build and explore insights around your products in a streamlined and maintainable workflow.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
The ecosystem for data professionals has matured to the point that there are a large and growing number of distinct roles. With the scope and importance of data steadily increasing it is important for organizations to ensure that everyone is aligned and operating in a positive environment. To help facilitate the nascent conversation about what constitutes an effective and productive data culture, the team at Data Council have dedicated an entire conference track to the subject. In this episode Pete Soderling and Maggie Hays join the show to explore this topic and their experience preparing for the upcoming conference.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
There has been a lot of discussion about the practical application of data mesh and how to implement it in an organization. Jean-Georges Perrin was tasked with designing a new data platform implementation at PayPal and wound up building a data mesh. In this episode he shares that journey and the combination of technical and organizational challenges that he encountered in the process.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Cloud data warehouses have unlocked a massive amount of innovation and investment in data applications, but they are still inherently limiting. Because of their complete ownership of your data they constrain the possibilities of what data you can store and how it can be used. Projects like Apache Iceberg provide a viable alternative in the form of data lakehouses that provide the scalability and flexibility of data lakes, combined with the ease of use and performance of data warehouses. Ryan Blue helped create the Iceberg project, and in this episode he rejoins the show to discuss how it has evolved and what he is doing in his new business Tabular to make it even easier to implement and maintain.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Data is a team sport, but it's often difficult for everyone on the team to participate. For a long time the mantra of data tools has been "by developers, for developers", which automatically excludes a large portion of the business members who play a crucial role in the success of any data project. Quilt Data was created as an answer to make it easier for everyone to contribute to the data being used by an organization and collaborate on its application. In this episode Aneesh Karve shares the journey that Quilt has taken to provide an approachable interface for working with versioned data in S3 that empowers everyone to collaborate.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
This podcast started almost exactly six years ago, and the technology landscape was much different than it is now. In that time there have been a number of generational shifts in how data engineering is done. In this episode I reflect on some of the major themes and take a brief look forward at some of the upcoming changes.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Business intelligence has gone through many generational shifts, but each generation has largely maintained the same workflow. Data analysts create reports that are used by the business to understand and direct the business, but the process is very labor and time intensive. The team at Omni have taken a new approach by automatically building models based on the queries that are executed. In this episode Chris Merrick shares how they manage integration and automation around the modeling layer and how it improves the organizational experience of business intelligence.
\n\nBusiness intelligence has gone through many evolutions. What are the unique capabilities that Omni Analytics offers over other players in the market?
\n\nWhat are the elements that contribute to BI being such a difficult product to use effectively in an organization?
Can you describe how you have implemented the Omni platform?
\n\nWhat does the workflow for a team using Omni look like?
What are some of the developments in the broader ecosystem that have made your work possible?
What are some of the positive and negative inspirations that you have drawn from the experience that you and your team-mates have gained in previous businesses?
What are the most interesting, innovative, or unexpected ways that you have seen Omni used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Omni?
When is Omni the wrong choice?
What do you have planned for the future of Omni?
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
The most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
The modern data stack has made it more economical to use enterprise grade technologies to power analytics at organizations of every scale. Unfortunately it has also introduced new overhead to manage the full experience as a single workflow. At the Modern Data Company they created the DataOS platform as a means of driving your full analytics lifecycle through code, while providing automatic knowledge graphs and data discovery. In this episode Srujan Akula explains how the system is implemented and how you can start using it today with your existing data systems.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Making effective use of data requires proper context around the information that is being used. As the size and complexity of your organization increases the difficulty of ensuring that everyone has the necessary knowledge about how to get their work done scales exponentially. Wikis and intranets are a common way to attempt to solve this problem, but they are frequently ineffective. Rehgan Avon co-founded AlignAI to help address this challenge through a more purposeful platform designed to collect and distribute the knowledge of how and why data is used in a business. In this episode she shares the strategic and tactical elements of how to make more effective use of the technical and organizational resources that are available to you for getting work done with data.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Five years of hosting the Data Engineering Podcast has provided Tobias Macey with a wealth of insight into the work of building and operating data systems at a variety of scales and for myriad purposes. In order to condense that acquired knowledge into a format that is useful to everyone Scott Hirleman turns the tables in this episode and asks Tobias about the tactical and strategic aspects of his experiences applying those lessons to the work of building a data platform from scratch.
\n\nHow did you get involved in the area of data management?
Data platform building journey
\n\nGeneral build vs buy and vendor selection process
\n\nGuest call out
\n\nTobias' advice and learnings from building out a data platform:
\n\nTobias' data platform components: data lakehouse paradigm, Airbyte for data integration (chosen over Meltano), Trino/Starburst Galaxy for distributed querying, AWS S3 for the storage layer, AWS Glue for very basic metadata cataloguing, Dagster as the crucial orchestration layer, dbt
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
One of the reasons that data work is so challenging is because no single person or team owns the entire process. This introduces friction in the process of collecting, processing, and using data. In order to reduce the potential for broken pipelines some teams have started to adopt the idea of data contracts. In this episode Abe Gong brings his experiences with the Great Expectations project and community to discuss the technical and organizational considerations involved in implementing these constraints to your data workflows.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
This is a cross-over episode from our new show The Machine Learning Podcast, the show about going from idea to production with machine learning.
\nData is one of the core ingredients for machine learning, but the format in which it is understandable to humans is not a useful representation for models. Embedding vectors are a way to structure data in a way that is native to how models interpret and manipulate information. In this episode Frank Liu shares how the Towhee library simplifies the work of translating your unstructured data assets (e.g. images, audio, video, etc.) into embeddings that you can use efficiently for machine learning, and how it fits into your workflow for model development.
\nThe intro and outro music is from Hitman’s Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0
\nOne of the most critical aspects of software projects is managing its data. Managing the operational concerns for your database can be complex and expensive, especially if you need to scale to large volumes of data, high traffic, or geographically distributed usage. Planetscale is a serverless option for your MySQL workloads that lets you focus on your applications without having to worry about managing the database or fight with differences between development and production. In this episode Nick van Wiggeren explains how the Planetscale platform is implemented, their strategies for balancing maintenance and improvements of the underlying Vitess project with their business goals, and how you can start using it today to free up the time you spend on database administration.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nBusiness intelligence is the foremost application of data in organizations of all sizes. The typical conception of how it is accessed is through a web or desktop application running on a powerful laptop. Zing Data is building a mobile native platform for business intelligence. This opens the door for busy employees to access and analyze their company information away from their desk, but it has the more powerful effect of bringing first-class support to companies operating in mobile-first economies. In this episode Sabin Thomas shares his experiences building the platform and the interesting ways that it is being used.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe term "real-time data" brings with it a combination of excitement, uncertainty, and skepticism. The promise of insights that are always accurate and up to date is appealing to organizations, but the technical realities to make it possible have been complex and expensive. In this episode Arjun Narayan explains how the technical barriers to adopting real-time data in your analytics and applications have become surmountable by organizations of all sizes.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe data ecosystem has been growing rapidly, with new communities joining and bringing their preferred programming languages to the mix. This has led to inefficiencies in how data is stored, accessed, and shared across process and system boundaries. The Arrow project is designed to eliminate wasted effort in translating between languages, and Voltron Data was created to help grow and support its technology and community. In this episode Wes McKinney shares the ways that Arrow and its related projects are improving the efficiency of data systems and driving their next stage of evolution.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nSponsored By:
The most expensive part of working with massive data sets is the work of retrieving and processing the files that contain the raw information. FeatureBase (formerly Pilosa) avoids that overhead by converting the data into bitmaps. In this episode Matt Jaffee explains how to model your data as bitmaps and the benefits that this representation provides for fast aggregate computation. He also discusses the improvements that have been incorporated into FeatureBase to simplify integration with the rest of your data stack, and the SQL interface that was added to make working with the product easier.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nSponsored By:
The majority of blog posts and presentations about data engineering and analytics assume that the consumers of those efforts are internal business users accessing an environment controlled by the business. In this episode Ian Schweer shares his experiences at Riot Games supporting player-focused features such as machine learning models and recommeder systems that are deployed as part of the game binary. He explains the constraints that he and his team are faced with and the various challenges that they have overcome to build useful data products on top of a legacy platform where they don’t control the end-to-end systems.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nSponsored By:
The problems that are easiest to fix are the ones that you prevent from happening in the first place. Sifflet is a platform that brings your entire data stack into focus to improve the reliability of your data assets and empower collaboration across your teams. In this episode CEO and founder Salma Bakouk shares her views on the causes and impacts of "data entropy" and how you can tame it before it leads to failures.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nSponsored By:
CreditKarma builds data products that help consumers take advantage of their credit and financial capabilities. To make that possible they need a reliable data platform that empowers all of the organization’s stakeholders. In this episode Vishnu Venkataraman shares the journey that he and his team have taken to build and evolve their systems and improve the product offerings that they are able to support.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nBuilding data products is an undertaking that has historically required substantial investments of time and talent. With the rise in cloud platforms and self-serve data technologies the barrier of entry is dropping. Shane Gibson co-founded AgileData to make analytics accessible to companies of all sizes. In this episode he explains the design of the platform and how it builds on agile development principles to help you focus on delivering value.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nDespite the best efforts of data engineers, data is as messy as the real world. Entity resolution and fuzzy matching are powerful utilities for cleaning up data from disconnected sources, but it has typically required custom development and training machine learning models. Sonal Goyal created and open-sourced Zingg as a generalized tool for data mastering and entity resolution to reduce the effort involved in adopting those practices. In this episode she shares the story behind the project, the details of how it is implemented, and how you can use it for your own data projects.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nA lot of the work that goes into data engineering is trying to make sense of the "data exhaust" from other applications and services. There is an undeniable amount of value and utility in that information, but it also introduces significant cost and time requirements. In this episode Nick King discusses how you can be intentional about data creation in your applications and services to reduce the friction and errors involved in building data products and ML applications. He also describes the considerations involved in bringing behavioral data into your systems, and the ways that he and the rest of the Snowplow team are working to make that an easy addition to your platforms.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nBusiness intelligence has grown beyond its initial manifestation as dashboards and reports. In its current incarnation it has become a ubiquitous need for analytics and opportunities to answer questions with data. In this episode Amir Orad discusses the Sisense platform and how it facilitates the embedding of analytics and data insights in every aspect of organizational and end-user experiences.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nSponsored By:
One of the most impactful technologies for data analytics in recent years has been dbt. It’s hard to have a conversation about data engineering or analysis without mentioning it. Despite its widespread adoption there are still rough edges in its workflow that cause friction for data analysts. To help simplify the adoption and management of dbt projects Nandam Karthik helped create Optimus. In this episode he shares his experiences working with organizations to adopt analytics engineering patterns and the ways that Optimus and dbt were combined to let data analysts deliver insights without the roadblocks of complex pipeline management.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nSponsored By:
Agile methodologies have been adopted by a majority of teams for building software applications. Applying those same practices to data can prove challenging due to the number of systems that need to be included to implement a complete feature. In this episode Shane Gibson shares practical advice and insights from his years of experience as a consultant and engineer working in data about how to adopt agile principles in your data work so that you can move faster and provide more value to the business, while building systems that are maintainable and adaptable.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nSponsored By:
The database market has seen unprecedented activity in recent years, with new options addressing a variety of needs being introduced on a nearly constant basis. Despite that, there are a handful of databases that continue to be adopted due to their proven reliability and robust features. MariaDB is one of those default options that has continued to grow and innovate while offering a familiar and stable experience. In this episode field CTO Manjot Singh shares his experiences as an early user of MySQL and MariaDB and explains how the suite of products being built on top of the open source foundation address the growing needs for advanced storage and analytical capabilities.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nLogistics and supply chains are under increased stress and scrutiny in recent years. In order to stay ahead of customer demands, businesses need to be able to react quickly and intelligently to changes, which requires fast and accurate insights into their operations. Pathway is a streaming database engine that embeds artificial intelligence into the storage, with functionality designed to support the spatiotemporal data that is crucial for shipping and logistics. In this episode Adrian Kosowski explains how the Pathway product got started, how its design simplifies the creation of data products that support supply chain operations, and how developers can help to build an ecosystem of applications that allow businesses to accelerate their time to insight.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe "data lakehouse" architecture balances the scalability and flexibility of data lakes with the ease of use and transaction support of data warehouses. Dremio is one of the companies leading the development of products and services that support the open lakehouse. In this episode Jason Hughes explains what it means for a lakehouse to be "open" and describes the different components that the Dremio team build and contribute to.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe core of any data platform is the centralized storage and processing layer. For many that is a data warehouse, but in order to support a diverse and constantly changing set of uses and technologies the data lakehouse is a paradigm that offers a useful balance of scale and cost, with performance and ease of use. In order to make the data lakehouse available to a wider audience the team at Iomete built an all-in-one service that handles management and integration of the various technologies so that you can worry about answering important business questions. In this episode Vusal Dadalov explains how the platform is implemented, the motivation for a truly open architecture, and how they have invested in integrating with the broader ecosystem to make it easy for you to get started.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nFor any business that wants to stay in operation, the most important thing they can do is understand their customers. American Express has invested substantial time and effort in their Customer 360 product to achieve that understanding. In this episode Purvi Shah, the VP of Enterprise Big Data Platforms at American Express, explains how they have invested in the cloud to power this visibility and the complex suite of integrations they have built and maintained across legacy and modern systems to make it possible.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe global economy is dependent on complex and dynamic networks of supply chains powered by sophisticated logistics. This requires a significant amount of data to track shipments and operational characteristics of materials and goods. Roambee is a platform that collects, integrates, and analyzes all of that information to provide companies with the critical insights that businesses need to stay running, especially in a time of such constant change. In this episode Roambee CEO, Sanjay Sharma, shares the types of questions that companies are asking about their logistics, the technical work that they do to provide ways to answer those questions, and how they approach the challenge of data quality in its many forms.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData lineage is something that has grown from a convenient feature to a critical need as data systems have grown in scale, complexity, and centrality to business. Alvin is a platform that aims to provide a low effort solution for data lineage capabilities focused on simplifying the work of data engineers. In this episode co-founder Martin Sahlen explains the impact that easy access to lineage information can have on the work of data engineers and analysts, and how he and his team have designed their platform to offer that information to engineers and stakeholders in the places that they interact with data.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData integration from source systems to their downstream destinations is the foundational step for any data product. With the increasing expecation for information to be instantly accessible, it drives the need for reliable change data capture. The team at Fivetran have recently introduced that functionality to power real-time data products. In this episode Mark Van de Wiel explains how they integrated CDC functionality into their existing product, discusses the nuances of different approaches to change data capture from various sources.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nRegardless of how data is being used, it is critical that the information is trusted. The practice of data reliability engineering has gained momentum recently to address that question. To help support the efforts of data teams the folks at Soda Data created the Soda Checks Language and the corresponding Soda Core utility that acts on this new DSL. In this episode Tom Baeyens explains their reasons for creating a new syntax for expressing and validating checks for data assets and processes, as well as how to incorporate it into your own projects.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nIn order to improve efficiency in any business you must first know what is contributing to wasted effort or missed opportunities. When your business operates across multiple locations it becomes even more challenging and important to gain insights into how work is being done. In this episode Tommy Yionoulis shares his experiences working in the service and hospitality industries and how that led him to found OpsAnalitica, a platform for collecting and analyzing metrics on multi location businesses and their operational practices. He discusses the challenges of making data collection purposeful and efficient without distracting employees from their primary duties and how business owners can use the provided analytics to support their staff in their duties.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThere is a constant tension in business data between growing siloes, and breaking them down. Even when a tool is designed to integrate information as a guard against data isolation, it can easily become a silo of its own, where you have to make a point of using it to seek out information. In order to help distribute critical context about data assets and their status into the locations where work is being done Nicholas Freund co-founded Workstream. In this episode he discusses the challenge of maintaining shared visibility and understanding of data work across the various stakeholders and his efforts to make it a seamless experience.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nAny business that wants to understand their operations and customers through data requires some form of pipeline. Building reliable data pipelines is a complex and costly undertaking with many layered requirements. In order to reduce the amount of time and effort required to build pipelines that power critical insights Manish Jethani co-founded Hevo Data. In this episode he shares his journey from building a consumer product to launching a data pipeline service and how his frustrations as a product owner have informed his work at Hevo Data.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nSponsored By:
Data engineering systems are complex and interconnected with myriad and often opaque chains of dependencies. As they scale, the problems of visibility and dependency management can increase at an exponential rate. In order to turn this into a tractable problem one approach is to define and enforce contracts between producers and consumers of data. Ananth Packildurai created Schemata as a way to make the creation of schema contracts a lightweight process, allowing the dependency chains to be constructed and evolved iteratively and integrating validation of changes into standard delivery systems. In this episode he shares the design of the project and how it fits into your development practices.
\nHello and welcome to the Data Engineering Podcast, the show about modern data management
\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
\nPrefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.
\nYour host is Tobias Macey and today I’m interviewing Ananth Packkildurai about Schemata, a modelling framework for decentralised domain-driven ownership of data.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData observability is a product category that has seen massive growth and adoption in recent years. Monte Carlo is in the vanguard of companies who have been enabling data teams to observe and understand their complex data systems. In this episode founders Barr Moses and Lior Gavish rejoin the show to reflect on the evolution and adoption of data observability technologies and the capabilities that are being introduced as the broader ecosystem adopts the practices.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe global climate impacts everyone, and the rate of change introduces many questions that businesses need to consider. Getting answers to those questions is challenging, because the climate is a multidimensional and constantly evolving system. Sust Global was created to provide curated data sets for organizations to be able to analyze climate information in the context of their business needs. In this episode Gopal Erinjippurath discusses the data engineering challenges of building and serving those data sets, and how they are distilling complex climate information into consumable facts so you don’t have to be an expert to understand it.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe dream of every engineer is to automate all of their tasks. For data engineers, this is a monumental undertaking. Orchestration engines are one step in that direction, but they are not a complete solution. In this episode Sean Knapp shares his views on what constitutes proper automation and the work that he and his team at Ascend are doing to help make it a reality.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nSponsored By:
AirBnB pioneered a number of the organizational practices that have become the goal of modern data teams. Out of that culture a number of successful businesses were created to provide the tools and methods to a broader audience. In this episode several almuni of AirBnB’s formative years who have gone on to found their own companies join the show to reflect on their shared successes, missed opportunities, and lessons learned.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe position of Chief Data Officer (CDO) is relatively new in the business world and has not been universally adopted. As a result, not everyone understands what the responsibilities of the role are, when you need one, and how to hire for it. In this episode Tracy Daniels, CDO of Truist, shares her journey into the position, her responsibilities, and her relationship to the data professionals in her organization.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData has permeated every aspect of our lives and the products that we interact with. As a result, end users and customers have come to expect interactions and updates with services and analytics to be fast and up to date. In this episode Shruti Bhat gives her view on the state of the ecosystem for real-time data and the work that she and her team at Rockset is doing to make it easier for engineers to build those experiences.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData engineers have typically left the process of data labeling to data scientists or other roles because of its nature as a manual and process heavy undertaking, focusing instead on building automation and repeatable systems. Watchful is a platform to make labeling a repeatable and scalable process that relies on codifying domain expertise. In this episode founder Shayan Mohanty explains how he and his team are bringing software best practices and automation to the world of machine learning data preparation and how it allows data engineers to be involved in the process.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData is useless if it isn’t being used, and you can’t use it if you don’t know where it is. Data catalogs were the first solution to this problem, but they are only helpful if you know what you are looking for. In this episode Shinji Kim discusses the challenges of data discovery and how to collect and preserve additional context about each piece of information so that you can find what you need when you don’t even know what you’re looking for yet.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData mesh is a frequent topic of conversation in the data community, with many debates about how and when to employ this architectural pattern. The team at AgileLab have first-hand experience helping large enterprise organizations evaluate and implement their own data mesh strategies. In this episode Paolo Platter shares the lessons they have learned in that process, the Data Mesh Boost platform that they have built to reduce some of the boilerplate required to make it successful, and some of the considerations to make when deciding if a data mesh is the right choice for you.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe optimal format for storage and retrieval of data is dependent on how it is going to be used. For analytical systems there are decades of investment in data warehouses and various modeling techniques. For machine learning applications relational models require additional processing to be directly useful, which is why there has been a growth in the use of vector databases. These platforms store direct representations of the vector embeddings that machine learning models rely on for computing relevant predictions so that there is no additional processing required to go from input data to inference output. In this episode Frank Liu explains how the open source Milvus vector database is implemented to speed up machine learning development cycles, how to think about proper storage and scaling of these vectors, and how data engineering and machine learning teams can collaborate on the creation and maintenance of these data sets.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nExploratory data analysis works best when the feedback loop is fast and iterative. This is easy to achieve when you are working on small datasets, but as they scale up beyond what can fit on a single machine those short iterations quickly become long and tedious. The Arkouda project is a Python interface built on top of the Chapel compiler to bring back those interactive speeds for exploratory analysis on horizontally scalable compute that parallelizes operations on large volumes of data. In this episode David Bader explains how the framework operates, the algorithms that are built into it to support complex analyses, and how you can start using it today.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData lineage is the roadmap for your data platform, providing visibility into all of the dependencies for any report, machine learning model, or data warehouse table that you are working with. Because of its centrality to your data systems it is valuable for debugging, governance, understanding context, and myriad other purposes. This means that it is important to have an accurate and complete lineage graph so that you don’t have to perform your own detective work when time is in short supply. In this episode Ernie Ostic shares the approach that he and his team at Manta are taking to build a complete view of data lineage across the various data systems in your organization and the useful applications of that information in the work of every data stakeholder.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData engineering is a difficult job, requiring a large number of skills that often don’t overlap. Any effort to understand how to start a career in the role has required stitching together information from a multitude of resources that might not all agree with each other. In order to provide a single reference for anyone tasked with data engineering responsibilities Joe Reis and Matt Housley took it upon themselves to write the book "Fundamentals of Data Engineering". In this episode they share their experiences researching and distilling the lessons that will be useful to data engineers now and into the future, without being tied to any specific technologies that may fade from fashion.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe current stage of evolution in the data management ecosystem has resulted in domain and use case specific orchestration capabilities being incorporated into various tools. This complicates the work involved in making end-to-end workflows visible and integrated. Dagster has invested in bringing insights about external tools’ dependency graphs into one place through its "software defined assets" functionality. In this episode Nick Schrock discusses the importance of orchestration and a central location for managing data systems, the road to Dagster’s 1.0 release, and the new features coming with Dagster Cloud’s general availability.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThere are extensive and valuable data sets that are available outside the bounds of your organization. Whether that data is public, paid, or scraped it requires investment and upkeep to acquire and integrate it with your systems. Crux was built to reduce the total cost of acquisition and ownership for integrating external data, offering a fully managed service for delivering those data assets in the manner that best suits your infrastructure. In this episode Crux CTO Mark Etherington discusses the different costs involved in managing external data, how to think about the total return on investment for your data, and how the Crux platform is architected to reduce the toil involved in managing third party data.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nSponsored By:
Data engineering is a large and growing subject, with new technologies, specializations, and "best practices" emerging at an accelerating pace. This podcast does its best to explore this fractal ecosystem, and has been at it for the past 5+ years. In this episode Joe Reis, founder of Ternary Data and co-author of "Fundamentals of Data Engineering", turns the tables and interviews the host, Tobias Macey, about his journey into podcasting, how he runs the show behind the scenes, and the other things that occupy his time.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nBuilding a data platform is a journey, not a destination. Beyond the work of assembling a set of technologies and building integrations across them, there is also the work of growing and organizing a team that can support and benefit from that platform. In this episode Inbar Yogev and Lior Winner share the journey that they and their teams at Riskified have been on for their data platform. They also discuss how they have established a guild system for training and supporting data professionals in the organization.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nBuilding and maintaining reliable data assets is the prime directive for data engineers. While it is easy to say, it is endlessly complex to implement, requiring data professionals to be experts in a wide range of disparate topics while designing and implementing complex topologies of information workflows. In order to make this a tractable problem it is essential that engineers embrace automation at every opportunity. In this episode Chris Riccomini shares his experiences building and scaling data operations at WePay and LinkedIn, as well as the lessons he has learned working with other teams as they automated their own systems.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe perennial challenge of data engineers is ensuring that information is integrated reliably. While it is straightforward to know whether a synchronization process succeeded, it is not always clear whether every record was copied correctly. In order to quickly identify if and how two data systems are out of sync Gleb Mezhanskiy and Simon Eskildsen partnered to create the open source data-diff utility. In this episode they explain how the utility is implemented to run quickly and how you can start using it in your own data workflows to ensure that your data warehouse isn’t missing any records from your source systems.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nSpecial Guest: Gleb Mezhanskiy.
","summary":"An interview with Gleb Mezhanskiy and Simon Eskildsen about how the open source data-diff utility can quickly and reliably validate data between your source and destination systems so that you can be confident that everything is working as intended.","date_published":"2022-07-03T16:45:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/d0042300-5a77-4f94-aad4-dfe5ba6d24a3.mp3","mime_type":"audio/mpeg","size_in_bytes":65484471,"duration_in_seconds":4257}]},{"id":"podlove-2022-07-03t20:47:22+00:00-f4a555a7a7583d8","title":"The View From The Lakehouse Of Architectural Patterns For Your Data Platform","url":"https://www.dataengineeringpodcast.com/starburst-lakehouse-modern-data-architecture-episode-304","content_text":"Summary\nThe ecosystem for data tools has been going through rapid and constant evolution over the past several years. These technological shifts have brought about corresponding changes in data and platform architectures for managing data and analytical workflows. In this episode Colleen Tartow shares her insights into the motivating factors and benefits of the most prominent patterns that are in the popular narrative; data mesh and the modern data stack. She also discusses her views on the role of the data lakehouse as a building block for these architectures and the ongoing influence that it will have as the technology matures.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!\nAtlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.\nModern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.\nTired of deploying bad data? Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes. Build powerful workflows that connect your entire data stack end-to-end with a mix of your code and their open-source, low-code templates. Once launched, Shipyard makes data observability easy with logging, alerting, and retries that will catch errors before your business team does. So whether you’re ingesting data from an API, transforming it with dbt, updating BI tools, or sending data alerts, Shipyard centralizes these operations and handles the heavy lifting so your data team can finally focus on what they’re good at — solving problems with data. Go to dataengineeringpodcast.com/shipyard to get started automating with their free developer plan today!\nYour host is Tobias Macey and today I’m interviewing Colleen Tartow about her views on the forces shaping the current generation of data architectures\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\n\nIn your opinion as an astrophysicist, how well does the metaphor of a starburst map to your current work at the company of the same name?\n\n\nCan you describe what you see as the dominant factors that influence a team’s approach to data architecture and design?\nTwo of the most repeated (often mis-attributed) terms in the data ecosystem for the past couple of years are the \"modern data stack\" and the \"data mesh\". As someone who is working at a company that can be construed to provide solutions for either/both of those patterns, what are your thoughts on their lasting strength and long-term viability?\nWhat do you see as the strengths of the emerging lakehouse architecture in the context of the \"modern data stack\"?\n\nWhat are the factors that have prevented it from being a default choice compared to cloud data warehouses? (e.g. BigQuery, Redshift, Snowflake, Firebolt, etc.)\nWhat are the recent developments that are contributing to its current growth?\nWhat are the weak points/sharp edges that still need to be addressed? (both internal to the platforms and in the external ecosystem/integrations)\n\n\nWhat are some of the implementation challenges that teams often experience when trying to adopt a lakehouse strategy as the core building block of their data systems?\n\nWhat are some of the exercises that they should be performing to help determine their technical and organizational capacity to support that strategy over the long term?\n\n\nOne of the core requirements for a data mesh implementation is to have a common system that allows for product teams to easily build their solutions on top of. How do lakehouse/data virtualization systems allow for that?\n\nWhat are some of the lessons that need to be shared with engineers to help them make effective use of these technologies when building their own data products?\nWhat are some of the supporting services that are helpful in these undertakings?\n\n\nWhat do you see as the forces that will have the most influence on the trajectory of data architectures over the next 2 – 5 years?\nWhat are the most interesting, innovative, or unexpected ways that you have seen lakehouse architectures used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on the Starburst product?\nWhen is a lakehouse the wrong choice?\nWhat do you have planned for the future of Starburst’s technology platform?\n\nContact Info\n\nLinkedIn\n@ctartow on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers\n\nLinks\n\nStarburst\nTrino\nTeradata\nCognos\nData Lakehouse\nData Virtualization\nIceberg\n\nPodcast Episode\n\n\nHudi\n\nPodcast Episode\n\n\nDelta\n\nPodcast Episode\n\n\nSnowflake\n\nPodcast Episode\n\n\nAWS Lake Formation\nClickhouse\n\nPodcast Episode\n\n\nDruid\nPinot\n\nPodcast Episode\n\n\nStarburst Galaxy\nVarada\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"The ecosystem for data tools has been going through rapid and constant evolution over the past several years. These technological shifts have brought about corresponding changes in data and platform architectures for managing data and analytical workflows. In this episode Colleen Tartow shares her insights into the motivating factors and benefits of the most prominent patterns that are in the popular narrative; data mesh and the modern data stack. She also discusses her views on the role of the data lakehouse as a building block for these architectures and the ongoing influence that it will have as the technology matures.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe proliferation of sensors and GPS devices has dramatically increased the number of applications for spatial data, and the need for scalable geospatial analytics. In order to reduce the friction involved in aggregating disparate data sets that share geographic similarities the Unfolded team built a platform that supports working across raster, vector, and tabular data in a single system. In this episode Isaac Brodsky explains how the Unfolded platform is architected, their experience joining the team at Foursquare, and how you can start using it for analyzing your spatial data today.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nSponsored By:
The most complicated part of data engineering is the effort involved in making the raw data fit into the narrative of the business. Master Data Management (MDM) is the process of building consensus around what the information actually means in the context of the business and then shaping the data to match those semantics. In this episode Malcolm Hawker shares his years of experience working in this domain to explore the combination of technical and social skills that are necessary to make an MDM project successful both at the outset and over the long term.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData analysis is a valuable exercise that is often out of reach of non-technical users as a result of the complexity of data systems. In order to lower the barrier to entry Ryan Buick created the Canvas application with a spreadsheet oriented workflow that is understandable to a wide audience. In this episode Ryan explains how he and his team have designed their platform to bring everyone onto a level playing field and the benefits that it provides to the organization.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nMetadata is the lifeblood of your data platform, providing information about what is happening in your systems. A variety of platforms have been developed to capture and analyze that information to great effect, but they are inherently limited in their utility due to their nature as storage systems. In order to level up their value a new trend of active metadata is being implemented, allowing use cases like keeping BI reports up to date, auto-scaling your warehouses, and automated data governance. In this episode Prukalpa Sankar joins the show to talk about the work she and her team at Atlan are doing to push this capability into the mainstream.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nUnstructured data takes many forms in an organization. From a data engineering perspective that often means things like JSON files, audio or video recordings, images, etc. Another category of unstructured data that every business deals with is PDFs, Word documents, workstation backups, and countless other types of information. Aparavi was created to tame the sprawl of information across machines, datacenters, and clouds so that you can reduce the amount of duplicate data and save time and money on managing your data assets. In this episode Rod Christensen shares the story behind Aparavi and how you can use it to cut costs and gain value for the long tail of your unstructured data.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nBuilding a well rounded and effective data team is an iterative process, and the first hire can set the stage for future success or failure. Trupti Natu has been the first data hire multiple times and gone through the process of building teams across the different stages of growth. In this episode she shares her thoughts and insights on how to be intentional about establishing your own data team.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe best way to make sure that you don’t leak sensitive data is to never have it in the first place. The team at Skyflow decided that the second best way is to build a storage system dedicated to securely managing your sensitive information and making it easy to integrate with your applications and data systems. In this episode Sean Falconer explains the idea of a data privacy vault and how this new architectural element can drastically reduce the potential for making a mistake with how you manage regulated or personally identifiable information.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nCloud services have made highly scalable and performant data platforms economical and manageable for data teams. However, they are still challenging to work with and manage for anyone who isn’t in a technical role. Hung Dang understood the need to make data more accessible to the entire organization and created Y42 as a better user experience on top of the "modern data stack". In this episode he shares how he designed the platform to support the full spectrum of technical expertise in an organization and the interesting engineering challenges involved.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nSponsored By:
The latest generation of data warehouse platforms have brought unprecedented operational simplicity and effectively infinite scale. Along with those benefits, they have also introduced a new consumption model that can lead to incredibly expensive bills at the end of the month. In order to ensure that you can explore and analyze your data without spending money on inefficient queries Mingsheng Hong and Zheng Shao created Bluesky Data. In this episode they explain how their platform optimizes your Snowflake warehouses to reduce cost, as well as identifying improvements that you can make in your queries to reduce their contribution to your bill.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nA large fraction of data engineering work involves moving data from one storage location to another in order to support different access and query patterns. Singlestore aims to cut down on the number of database engines that you need to run so that you can reduce the amount of copying that is required. By supporting fast, in-memory row-based queries and columnar on-disk representation, it lets your transactional and analytical workloads run in the same database. In this episode SVP of engineering Shireesh Thota describes the impact on your overall system architecture that Singlestore can have and the benefits of using a cloud-native database engine for your next application.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe interfaces and design cues that a tool offers can have a massive impact on who is able to use it and the tasks that they are able to perform. With an eye to making data workflows more accessible to everyone in an organization Raj Bains and his team at Prophecy designed a powerful and extensible low-code platform that lets technical and non-technical users scale data flows without forcing everyone into the same layers of abstraction. In this episode he explores the tension between code-first and no-code utilities and how he is working to balance the strengths without falling prey to their shortcomings.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nMachine learning has become a meaningful target for data applications, bringing with it an increase in the complexity of orchestrating the entire data flow. Flyte is a project that was started at Lyft to address their internal needs for machine learning and integrated closely with Kubernetes as the execution manager. In this episode Ketan Umare and Haytham Abuelfutuh share the story of the Flyte project and how their work at Union is focused on supporting and scaling the code and community that has made Flyte successful.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nSponsored By:
Industrial applications are one of the primary adopters of Internet of Things (IoT) technologies, with business critical operations being informed by data collected across a fleet of sensors. Vopak is a business that manages storage and distribution of a variety of liquids that are critical to the modern world, and they have recently launched a new platform to gain more utility from their industrial sensors. In this episode Mário Pereira shares the system design that he and his team have developed for collecting and managing the collection and analysis of sensor data, and how they have split the data processing and business logic responsibilities between physical terminals and edge locations, and centralized storage and compute.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nDesigning a data platform is a complex and iterative undertaking which requires accounting for many conflicting needs. Designing a platform that relies on a data lake as its central architectural tenet adds additional layers of difficulty. Srivatsan Sridharan has had the opportunity to design, build, and run data lake platforms for both Yelp and Robinhood, with many valuable lessons learned from each experience. In this episode he shares his insights and advice on how to approach such an undertaking in your own organization.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nDan Delorey helped to build the core technologies of Google’s cloud data services for many years before embarking on his latest adventure as the VP of Data at SoFi. From being an early engineer on the Dremel project, to helping launch and manage BigQuery, on to helping enterprises adopt Google’s data products he learned all of the critical details of how to run services used by data platform teams. Now he is the consumer of many of the tools that his work inspired. In this episode he takes a trip down memory lane to weave an interesting and informative narrative about the broader themes throughout his work and their echoes in the modern data ecosystem.
\nIntroduction
\nHow did you get involved in the area of data management?
\nCan you start by sharing what your current relationship to the data ecosystem is and the cliffs-notes version of how you ended up there?
\nDremel was a ground-breaking technology at the time. What do you see as its lasting impression on the landscape of data both in and outside of Google?
\nYou were instrumental in crafting the vision behind "querying data in place," (what they called, federated data) at Dremel and BigQuery. What do you mean by this? How has this approach evolved? What are some challenges with this approach?
\nFollowing your work on Drill you were involved with the development and growth of BigQuery and the broader suite of Google Cloud’s data platform. What do you see as the influence that those tools had on the evolution of the broader data ecosystem?
\nHow have your experiences at Google influenced your approach to platform and organizational design at SoFi?
\nWhat’s in SoFi’s data stack? How do you decide what technologies to buy vs. build in-house?
\nHow does your team solve for data quality and governance?
\nWhen you’re not building industry-defining data tooling or leading data strategy, you spend time thinking about the ethics of data. Can you elaborate a bit about your research and interest there?
\nYou also have some ideas about data marketplaces, which is a hot topic these days with companies like Snowflake and Databricks breaking into this economy. What’s your take on the evolution of this space?
\nWhat are the most interesting, innovative, or unexpected data systems that you have encountered?
\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on building and supporting data systems?
\nWhat are the areas that you are paying the most attention to?
\nWhat interesting predictions do you have for the future of data systems and their applications?
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nMany of the events, ideas, and objects that we try to represent through data have a high degree of connectivity in the real world. These connections are best represented and analyzed as graphs to provide efficient and accurate analysis of their relationships. TigerGraph is a leading database that offers a highly scalable and performant native graph engine for powering graph analytics and machine learning. In this episode Jon Herke shares how TigerGraph customers are taking advantage of those capabilities to achieve meaningful discoveries in their fields, the utilities that it provides for modeling and managing your connected data, and some of his own experiences working with the platform before joining the company.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe predominant pattern for data integration in the cloud has become extract, load, and then transform or ELT. Matillion was an early innovator of that approach and in this episode CTO Ed Thompson explains how they have evolved the platform to keep pace with the rapidly changing ecosystem. He describes how the platform is architected, the challenges related to selling cloud technologies into enterprise organizations, and how you can adopt Matillion for your own workflows to reduce the maintenance burden of data integration workflows.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nBuilding a data platform is an iterative and evolutionary process that requires collaboration with internal stakeholders to ensure that their needs are being met. Yotpo has been on a journey to evolve and scale their data platform to continue serving the needs of their organization as it increases the scale and sophistication of data usage. In this episode Doron Porat and Liran Yogev explain how they arrived at their current architecture, the capabilities that they are optimizing for, and the complex process of identifying and evaluating new components to integrate into their systems. This is an excellent exploration of the decisions and tradeoffs that need to be made while building such a complex system.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nA huge amount of effort goes into modeling and shaping data to make it available for analytical purposes. This is often due to the need to simplify the final queries so that they are performant for visualization or limited exploration. In order to cut down the level of effort involved in making data usable, Matthew Halliday and his co-founders created Incorta as an end-to-end, in-memory analytical engine that removes barriers to insights on your data. In this episode he explains how the system works, the use cases that it empowers, and how you can start using it for your own analytics today.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThere are very few tools which are equally useful for data engineers, data scientists, and machine learning engineers. WhyLogs is a powerful library for flexibly instrumenting all of your data systems to understand the entire lifecycle of your data from source to productionized model. In this episode Andy Dang explains why the project was created, how you can apply it to your existing data systems, and how it functions to provide detailed context for being able to gain insight into all of your data processes.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe next paradigm shift in computing is coming in the form of quantum technologies. Quantum procesors have gained significant attention for their speed and computational power. The next frontier is in quantum networking for highly secure communications and the ability to distribute across quantum processing units without costly translation between quantum and classical systems. In this episode Prineha Narang, co-founder and CTO of Aliro, explains how these systems work, the capabilities that they can offer, and how you can start preparing for a post-quantum future for your data systems.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nPutting machine learning models into production and keeping them there requires investing in well-managed systems to manage the full lifecycle of data cleaning, training, deployment and monitoring. This requires a repeatable and evolvable set of processes to keep it functional. The term MLOps has been coined to encapsulate all of these principles and the broader data community is working to establish a set of best practices and useful guidelines for streamlining adoption. In this episode Demetrios Brinkmann and David Aponte share their perspectives on this rapidly changing space and what they have learned from their work building the MLOps community through blog posts, podcasts, and discussion forums.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData engineering is a practice that is multi-faceted and requires integration with a large number of systems. This often means working across multiple tools to get the job done which can introduce significant cost to productivity due to the number of context switches. Rivery is a platform designed to reduce this incidental complexity and provide a single system for working across the different stages of the data lifecycle. In this episode CEO and founder Itamar Ben hemo explains how his experiences in the industry led to his vision for the Rivery platform as a single place to build end-to-end analytical workflows, including how it is architected and how you can start using it today for your own work.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nAny time that you are storing data about people there are a number of privacy and security considerations that come with it. Privacy engineering is a growing field in data management that focuses on how to protect attributes of personal data so that the containing datasets can be shared safely. In this episode Gretel co-founder and CTO John Myers explains how they are building tools for data engineers and analysts to incorporate privacy engineering techniques into their workflows and validate the safety of their data against re-identification attacks.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe flexibility of software oriented data workflows is useful for fulfilling complex requirements, but for simple and repetitious use cases it adds significant complexity. Coalesce is a platform designed to reduce repetitive work for common workflows by adopting a visual pipeline builder to support your data warehouse transformations. In this episode Satish Jayanthi explains how he is building a framework to allow enterprises to move quickly while maintaining guardrails for data workflows. This allows everyone in the business to participate in data analysis in a sustainable manner.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nBuilding a data platform for your organization is a challenging undertaking. Building multiple data platforms for other organizations as a service without burning out is another thing entirely. In this episode Brandon Beidel from Red Ventures shares his experiences as a data product manager in charge of helping his customers build scalable analytics systems that fit their needs. He explains the common patterns that have been useful across multiple use cases, as well as when and how to build customized solutions.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nAt the foundational layer many databases and data processing engines rely on key/value storage for managing the layout of information on the disk. RocksDB is one of the most popular choices for this component and has been incorporated into popular systems such as ksqlDB. As these systems are scaled to larger volumes of data and higher throughputs the RocksDB engine can become a bottleneck for performance. In this episode Adi Gelvan shares the work that he and his team at SpeeDB have put into building a drop-in replacement for RocksDB that eliminates that bottleneck. He explains how they redesigned the core algorithms and storage management features to deliver ten times faster throughput, how the lower latencies work to reduce the burden on platform engineers, and how they are working toward an open source offering so that you can try it yourself with no friction.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData governance is a practice that requires a high degree of flexibility and collaboration at the organizational and technical levels. The growing prominence of cloud and hybrid environments in data management adds additional stress to an already complex endeavor. Privacera is an enterprise grade solution for cloud and hybrid data governance built on top of the robust and battle tested Apache Ranger project. In this episode Balaji Ganesan shares how his experiences building and maintaining Ranger in previous roles helped him understand the needs of organizations and engineers as they define and evolve their data governance policies and practices.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nSponsored By:
Data assets and the pipelines that create them have become critical production infrastructure for companies. This adds a requirement for reliability and management of up-time similar to application infrastructure. In this episode Francisco Alberini and Mei Tao share their insights on what incident management looks like for data platforms and the teams that support them.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData and analytics are permeating every system, including customer-facing applications. The introduction of embedded analytics to an end-user product creates a significant shift in requirements for your data layer. The Pinot OLAP datastore was created for this purpose, optimizing for low latency queries on rapidly updating datasets with highly concurrent queries. In this episode Kishore Gopalakrishna and Xiang Fu explain how it is able to achieve those characteristics, their work at StarTree to make it more easily available, and how you can start using it for your own high throughput data workloads today.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData observability is a term that has been co-opted by numerous vendors with varying ideas of what it should mean. At Acceldata, they view it as a holistic approach to understanding the computational and logical elements that power your analytical capabilities. In this episode Tristan Spaulding, head of product at Acceldata, explains the multi-dimensional nature of gaining visibility into your running data platform and how they have architected their platform to assist in that endeavor.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe modern data stack is a constantly moving target which makes it difficult to adopt without prior experience. In order to accelerate the time to deliver useful insights at organizations of all sizes that are looking to take advantage of these new and evolving architectures Tarush Aggarwal founded 5X Data. In this episode he explains how he works with these companies to deploy the technology stack and pairs them with an experienced engineer who assists with the implementation and training to let them realize the benefits of this architecture. He also shares his thoughts on the current state of the ecosystem for modern data vendors and trends to watch as we move into the future.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nWhen you think about selecting a database engine for your project you typically consider options focused on serving multiple concurrent users. Sometimes what you really need is an embedded database that is blazing fast for single user workloads. DuckDB is an in-process database engine optimized for OLAP applications to speed up your analytical queries that meets you where you are, whether that’s Python, R, Java, even the web. In this episode, Hannes Mühleisen, co-creator and CEO of DuckDB Labs, shares the motivations for creating the project, the myriad ways that it can be used to speed up your data projects, and the detailed engineering efforts that go into making it adaptable to any environment. This is a fascinating and humorous exploration of a truly useful piece of technology.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nDatabases are an important component of application architectures, but they are often difficult to work with. HarperDB was created with the core goal of being a developer friendly database engine. In the process they ended up creating a scalable distributed engine that works across edge and datacenter environments to support a variety of novel use cases. In this episode co-founder and CEO Stephen Goldberg shares the history of the project, how it is architected to achieve their goals, and how you can start using it today.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThere are a wealth of options for managing structured and textual data, but unstructured binary data assets are not as well supported across the ecosystem. As organizations start to adopt cloud technologies they need a way to manage the distribution, discovery, and collaboration of data across their operating environments. To help solve this complicated challenge Krishna Subramanian and her co-founders at Komprise built a system that allows you to treat use and secure your data wherever it lives, and track copies across environments without requiring manual intervention. In this episode she explains the difficulties that everyone faces as they scale beyond a single operating environment, and how the Komprise platform reduces the burden of managing large and heterogeneous collections of unstructured files.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nBuilding a data platform is a complex journey that requires a significant amount of planning to do well. It requires knowledge of the available technologies, the requirements of the operating environment, and the expectations of the stakeholders. In this episode Tobias Macey, the host of the show, reflects on his plans for building a data platform and what he has learned from running the podcast that is influencing his choices.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nPython has grown to be one of the top languages used for all aspects of data, from collection and cleaning, to analysis and machine learning. Along with that growth has come an explosion of tools and engines that help power these workflows, which introduces a great deal of complexity when scaling from single machines and exploratory development to massively parallel distributed computation. In answer to that challenge the Fugue project offers an interface to automatically translate across Pandas, Spark, and Dask execution environments without having to modify your logic. In this episode core contributor Kevin Kho explains how the slight differences in the underlying engines can lead to big problems, how Fugue works to hide those differences from the developer, and how you can start using it in your own work today.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe life sciences as an industry has seen incredible growth in scale and sophistication, along with the advances in data technology that make it possible to analyze massive amounts of genomic information. In this episode Guy Yachdav, director of software engineering for ImmunAI, shares the complexities that are inherent to managing data workflows for bioinformatics. He also explains how he has architected the systems that ingest, process, and distribute the data that he is responsible for and the requirements that are introduced when collaborating with researchers, domain experts, and machine learning developers.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nStreaming data sources are becoming more widely available as tools to handle their storage and distribution mature. However it is still a challenge to analyze this data as it arrives, while supporting integration with static data in a unified syntax. Deephaven is a project that was designed from the ground up to offer an intuitive way for you to bring your code to your data, whether it is streaming or static without having to know which is which. In this episode Pete Goddard, founder and CEO of Deephaven shares his journey with the technology that powers the platform, how he and his team are pouring their energy into the community edition of the technology so that you can use it freely in your own work.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nCollecting, integrating, and activating data are all challenging activities. When that data pertains to your customers it can become even more complex. To simplify the work of managing the full flow of your customer data and keep you in full control the team at Rudderstack created their eponymous open source platform that allows you to work with first and third party data, as well as build and manage reverse ETL workflows. In this episode CEO and founder Soumyadeb Mitra explains how Rudderstack compares to the various other tools and platforms that share some overlap, how to set it up for your own data needs, and how it is architected to scale to meet demand.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nAlong with globalization of our societies comes the need to analyze the geospatial and geotemporal data that is needed to manage the growth in commerce, communications, and other activities. In order to make geospatial analytics more maintainable and scalable there has been an increase in the number of database engines that provide extensions to their SQL syntax that supports manipulation of spatial data. In this episode Matthew Forrest shares his experiences of working in the domain of geospatial analytics and the application of SQL dialects to his analysis.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThere are many dimensions to the work of protecting the privacy of users in our data. When you need to share a data set with other teams, departments, or businesses then it is of utmost importance that you eliminate or obfuscate personal information. In this episode Will Thompson explores the many ways that sensitive data can be leaked, re-identified, or otherwise be at risk, as well as the different strategies that can be employed to mitigate those attack vectors. He also explains how he and his team at Privacy Dynamics are working to make those strategies more accessible to organizations so that you can focus on all of the other tasks required of you.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe Data Engineering Podcast has been going for five years now and has included conversations and interviews with a huge number of guests, covering a broad range of topics. In addition to that, the host curated the essays contained in the book "97 Things Every Data Engineer Should Know", using the knowledge and context gained from running the show to inform the selection process. In this episode he shares some reflections on producing the podcast, compiling the book, and relevant trends in the ecosystem of data engineering. He also provides some advice for those who are early in their career of data engineering and looking to advance in their roles.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nPandas is a powerful tool for cleaning, transforming, manipulating, or enriching data, among many other potential uses. As a result it has become a standard tool for data engineers for a wide range of applications. Matt Harrison is a Python expert with a long history of working with data who now spends his time on consulting and training. He recently wrote a book on effective patterns for Pandas code, and in this episode he shares advice on how to write efficient data processing routines that will scale with your data volumes, while being understandable and maintainable.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData platforms are exemplified by a complex set of connections that are subject to a set of constantly evolving requirements. In order to make this a tractable problem it is necessary to define boundaries for communication between concerns, which brings with it the need to establish interface contracts for communicating across those boundaries. The recent move toward the data mesh as a formalized architecture that builds on this design provides the language that data teams need to make this a more organized effort. In this episode Abhi Sivasailam shares his experience designing and implementing a data mesh solution with his team at Flexport, and the importance of defining and enforcing data contracts that are implemented at those domain boundaries.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData engineering is a relatively young and rapidly expanding field, with practitioners having a wide array of experiences as they navigate their careers. Ashish Mrig currently leads the data analytics platform for Wayfair, as well as running a local data engineering meetup. In this episode he shares his career journey, the challenges related to management of data professionals, and the platform design that he and his team have built to power analytics at a large company. He also provides some excellent insights into the factors that play into the build vs. buy decision at different organizational sizes.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData quality control is a requirement for being able to trust the various reports and machine learning models that are relying on the information that you curate. Rules based systems are useful for validating known requirements, but with the scale and complexity of data in modern organizations it is impractical, and often impossible, to manually create rules for all potential errors. The team at Anomalo are building a machine learning powered platform for identifying and alerting on anomalous and invalid changes in your data so that you aren’t flying blind. In this episode founders Elliot Shmukler and Jeremy Stanley explain how they have architected the system to work with your data warehouse and let you know about the critical issues hiding in your data without overwhelming you with alerts.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nApplications of data have grown well beyond the venerable business intelligence dashboards that organizations have relied on for decades. Now it is being used to power consumer facing services, influence organizational behaviors, and build sophisticated machine learning systems. Given this increased level of importance it has become necessary for everyone in the business to treat data as a product in the same way that software applications have driven the early 2000s. In this episode Brian McMillan shares his work on the book "Building Data Products" and how he is working to educate business users and data professionals about the combination of technical, economical, and business considerations that need to be blended for these projects to succeed.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nReverse ETL is a product category that evolved from the landscape of customer data platforms with a number of companies offering their own implementation of it. While struggling with the work of automating data integration workflows with marketing, sales, and support tools Brian Leonard accidentally discovered this need himself and turned it into the open source framework Grouparoo. In this episode he explains why he decided to turn these efforts into an open core business, how the platform is implemented, and the benefits of having an open source contender in the landscape of operational analytics products.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData observability is a set of technical and organizational capabilities related to understanding how your data is being processed and used so that you can proactively identify and fix errors in your workflows. In this episode Metaplane founder Kevin Hu shares his working definition of the term and explains the work that he and his team are doing to cut down on the time to adoption for this new set of practices. He discusses the factors that influenced his decision to start with the data warehouse, the potential shortcomings of that approach, and where he plans to go from there. This is a great exploration of what it means to treat your data platform as a living system and apply state of the art engineering to it.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nCommunication and shared context are the hardest part of any data system. In recent years the focus has been on data catalogs as the means for documenting data assets, but those introduce a secondary system of record in order to find the necessary information. In this episode Emily Riederer shares her work to create a controlled vocabulary for managing the semantic elements of the data managed by her team and encoding it in the schema definitions in her data warehouse. She also explains how she created the dbtplyr package to simplify the work of creating and enforcing your own controlled vocabularies.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThis has been an active year for the data ecosystem, with a number of new product categories and substantial growth in existing areas. In an attempt to capture the zeitgeist Maura Church, David Wallace, Benn Stancil, and Gleb Mezhanskiy join the show to reflect on the past year and share their thought son the year to come.
\nIntroduction
\nHow did you get involved in the area of data management?
\nWhat were the main themes that you saw data practitioners and vendors focused on this year?
\nWhat is the major bottleneck for Data teams in 2021? Will it be the same in 2022?\nOne of the ways to reason about progress in any domain is to look at what was the primary bottleneck of further progress (data adoption for decision making) at different points in time. In the data domain, we have seen a number of bottlenecks, for example, scaling data platforms, the answer to which was Hadoop and on-prem columnar stores and then cloud data warehouses such as Snowflake & BigQuery. Then the problem was data integration and transformation which was solved by data integration vendors and frameworks such as Fivetran / Airbyte, modern orchestration frameworks such as Dagster & dbt and “reverse-ETL” Hightouch. What is the main challenge now?
\nWill SQL be challenged as a primary interface to analytical data?\nIn 2020 we’ve seen a few launches of post-SQL languages such as Malloy, Preql, metric layer query languages from Transform and Supergrain.
\nTo what extent does speed matter?\nOver the past couple of months, we’ve seen the resurgence of “benchmark wars” between major data warehousing platforms. To what extent do speed benchmarks inform decisions for modern data teams? How important is query speed in a modern data workflow? What needs to be true about your current DWH solution and potential alternatives to make a move?
\nHow has the way data teams work been changing?\nIn 2020 remote seemed like a temporary emergency state. In 2021, it went mainstream. How has that affected the day-to-day of data teams, how they collaborate internally and with stakeholders?
\nWhat’s it like to be a data vendor in 2021?
\nVertically integrated vs. modular data stack?\nThere are multiple forces in play. Will the stack continue to be fragmented? Will we see major consolidation? If so, in which parts of the stack?
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData Engineering is still a relatively new field that is going through a continued evolution as new technologies are introduced and new requirements are understood. In this episode Maxime Beauchemin returns to revisit what it means to be a data engineer and how the role has changed over the past 5 years.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe data mesh is a thesis that was presented to address the technical and organizational challenges that businesses face in managing their analytical workflows at scale. Zhamak Dehghani introduced the concepts behind this architectural patterns in 2019, and since then it has been gaining popularity with many companies adopting some version of it in their systems. In this episode Zhamak re-joins the show to discuss the real world benefits that have been seen, the lessons that she has learned while working with her clients and the community, and her vision for the future of the data mesh.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nOne of the perennial challenges of data analytics is having a consistent set of definitions, along with a flexible and performant API endpoint for querying them. In this episode Artom Keydunov and Pavel Tiunov share their work on Cube.js and the various ways that it is being used in the open source community.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nBuilding a well managed data ecosystem for your organization requires a holistic view of all of the producers, consumers, and processors of information. The team at Metaphor are building a fully connected metadata layer to provide both technical and social intelligence about your data. In this episode Pardhu Gunnam and Mars Lan explain how they have designed the architecture and user experience to allow everyone to collaborate on the data lifecycle and provide opportunities for automation and extensible workflows.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nSpark is a powerful and battle tested framework for building highly scalable data pipelines. Because of its proven ability to handle large volumes of data Capital One has invested in it for their business needs. In this episode Gokul Prabagaren shares his use for it in calculating your rewards points, including the auditing requirements and how he designed his pipeline to maintain all of the necessary information through a pattern of data enrichment.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe core to providing your users with excellent service is to understand them and provide a personalized experience. Unfortunately many sites and applications take that to the extreme and collect too much information. In order to make it easier for developers to build customer profiles in a way that respects their privacy Serge Huber helped to create the Apache Unomi framework as an open source customer data platform. In this episode he explains how it can be used to build rich and useful profiles of your users, the system architecture that powers it, and some of the ways that it is being integrated into an organization’s broader data ecosystem.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nHiring data professionals is challenging for a multitude of reasons, and as with every interview process there is a potential for bias to creep in. Tim Freestone founded Alooba to provide a more stable reference point for evaluating candidates to ensure that you can make more informed comparisons based on their actual knowledge. In this episode he explains how Alooba got started, how it is being used in the interview process for data oriented roles, and how it can also provide visibility into your organizations overall data literacy. The whole process of hiring is an important organizational skill to cultivate and this is an interesting exploration of the specific challenges involved in finding data professionals.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nA/B testing and experimentation are the most reliable way to determine whether a change to your product will have the desired effect on your business. Unfortunately, being able to design, deploy, and validate experiments is a complex process that requires a mix of technical capacity and organizational involvement which is hard to come by. Chetan Sharma founded Eppo to provide a system that organizations of every scale can use to reduce the burden of managing experiments so that you can focus on improving your business. In this episode he digs into the technical, statistical, and design requirements for running effective experiments and how he has architected the Eppo platform to make the process more accessible to business and data professionals.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe modern data stack has been gaining a lot of attention recently with a rapidly growing set of managed services for different stages of the data lifecycle. With all of the available options it is possible to run a scalable, production grade data platform with a small team, but there are still sharp edges and integration challenges to work through. Peter Fishman and Dan Silberman experienced these difficulties firsthand and created Mozart Data to provide a single, easy to use option for getting started with the modern data stack. In this episode they explain how they designed a user experience to make working with data more accessibly by organizations without a data team, while allowing for more advanced users to build out more complex workflows. They also share their thoughts on the modern data ecosystem and how it improves the availability of analytics for companies of all sizes.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe data that you have access to affects the questions that you can answer. By using external data sources you can drastically increase the range of analysis that is available to your organization. The challenge comes in all of the operational aspects of finding, accessing, organizing, and serving that data. In this episode Mark Hookey discusses how he and his team at Demyst do all of the DataOps for external data sources so that you don’t have to, including the systems necessary to organize and catalog the various collections that they host, the various serving layers to provide query interfaces that match your platform, and the utility of having a single place to access a multitude of information. If you are having trouble answering questions for your business with the data that you generate and collect internally, then it is definitely worthwhile to explore the information available from external sources.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe technology for scaling storage and processing of data has gone through massive evolution over the past decade, leaving us with the ability to work with massive datasets at the cost of massive complexity. Nick Schrock created the Dagster framework to help tame that complexity and scale the organizational capacity for working with data. In this episode he shares the journey that he and his team at Elementl have taken to understand the state of the ecosystem and how they can provide a foundational layer for a holistic data platform.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nOne of the perennial challenges posed by data lakes is how to keep them up to date as new data is collected. With the improvements in streaming engines it is now possible to perform all of your data integration in near real time, but it can be challenging to understand the proper processing patterns to make that performant. In this episode Ori Rafael shares his experiences from Upsolver and building scalable stream processing for integrating and analyzing data, and what the tradeoffs are when coming from a batch oriented mindset.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe most important gauge of success for a data platform is the level of trust in the accuracy of the information that it provides. In order to build and maintain that trust it is necessary to invest in defining, monitoring, and enforcing data quality metrics. In this episode Michael Harper advocates for proactive data quality and starting with the source, rather than being reactive and having to work backwards from when a problem is found.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nA significant source of friction and wasted effort in building and integrating data management systems is the fragmentation of metadata across various tools. After experiencing the impacts of fragmented metadata and previous attempts at building a solution Suresh Srinivas and Sriharsha Chintalapani created the OpenMetadata project. In this episode they share the lessons that they have learned through their previous attempts and the positive impact that a unified metadata layer had during their time at Uber. They also explain how the OpenMetadat project is aiming to be a common standard for defining and storing metadata for every use case in data platforms and the ways that they are architecting the reference implementation to simplify its adoption. This is an ambitious and exciting project, so listen and try it out today.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nBusiness intelligence is often equated with a collection of dashboards that show various charts and graphs representing data for an organization. What is overlooked in that characterization is the level of complexity and effort that are required to collect and present that information, and the opportunities for providing those insights in other contexts. In this episode Telmo Silva explains how he co-founded ClicData to bring full featured business intelligence and reporting to every organization without having to build and maintain that capability on their own. This is a great conversation about the technical and organizational operations involved in building a comprehensive business intelligence system and the current state of the market.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe precursor to widespread adoption of cloud data warehouses was the creation of customer data platforms. Acting as a centralized repository of information about how your customers interact with your organization they drove a wave of analytics about how to improve products based on actual usage data. A natural outgrowth of that capability is the more recent growth of reverse ETL systems that use those analytics to feed back into the operational systems used to engage with the customer. In this episode Tejas Manohar and Rachel Bradley-Haas share the story of their own careers and experiences coinciding with these trends. They also discuss the current state of the market for these technological patterns and how to take advantage of them in your own work.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe perennial question of data warehousing is how to model the information that you are storing. This has given rise to methods as varied as star and snowflake schemas, data vault modeling, and wide tables. The challenge with many of those approaches is that they are optimized for answering known questions but brittle and cumbersome when exploring unknowns. In this episode Ahmed Elsamadisi shares his journey to find a more flexible and universal data model in the form of the "activity schema" that is powering the Narrator platform, and how it has allowed his customers to perform self-service exploration of their business domains without being blocked by schema evolution in the data warehouse. This is a fascinating exploration of what can be done when you challenge your assumptions about what is possible.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nStreaming data systems have been growing more capable and flexible over the past few years. Despite this, it is still challenging to build reliable pipelines for stream processing. In this episode Eric Sammer discusses the shortcomings of the current set of streaming engines and how they force engineers to work at an extremely low level of abstraction. He also explains why he started Decodable to address that limitation and the work that he and his team have done to let data engineers build streaming pipelines entirely in SQL.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe market for business intelligence has been going through an evolutionary shift in recent years. One of the driving forces for that change has been the rise of analytics engineering powered by dbt. Lightdash has fully embraced that shift by building an entire open source business intelligence framework that is powered by dbt models. In this episode Oliver Laslett describes why dashboards aren’t sufficient for business analytics, how Lightdash promotes the work that you are already doing in your data warehouse modeling with dbt, and how they are focusing on bridging the divide between data teams and business teams and the requirements that they have for data workflows.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe focus of the past few years has been to consolidate all of the organization’s data into a cloud data warehouse. As a result there have been a number of trends in data that take advantage of the warehouse as a single focal point. Among those trends is the advent of operational analytics, which completes the cycle of data from collection, through analysis, to driving further action. In this episode Boris Jabes, CEO of Census, explains how the work of synchronizing cleaned and consolidated data about your customers back into the systems that you use to interact with those customers allows for a powerful feedback loop that has been missing in data systems until now. He also discusses how Census makes that synchronization easy to manage, how it fits with the growth of data quality tooling, and how you can start using it today.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe binding element of all data work is the metadata graph that is generated by all of the workflows that produce the assets used by teams across the organization. The DataHub project was created as a way to bring order to the scale of LinkedIn’s data needs. It was also designed to be able to work for small scale systems that are just starting to develop in complexity. In order to support the project and make it even easier to use for organizations of every size Shirshanka Das and Swaroop Jagadish founded Acryl Data. In this episode they discuss the recent work that has been done by the community, how their work is building on top of that foundation, and how you can get started with DataHub for your own work to manage data discovery today. They also share their ambitions for the near future of adding data observability and data quality management features.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nOrganizations of all sizes are striving to become data driven, starting in earnest with the rise of big data a decade ago. With the never-ending growth in data sources and methods for aggregating and analyzing them, the use of data to direct the business has become a requirement. Randy Bean has been helping enterprise organizations define and execute their data strategies since before the age of big data. In this episode he discusses his experiences and how he approached the work of distilling them for his book "Fail Fast, Learn Faster". This is an entertaining and enlightening exploration of the business side of data with an industry veteran.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe key to making data valuable to business users is the ability to calculate meaningful metrics and explore them along useful dimensions. Business intelligence tools have provided this capability for years, but they don’t offer a means of exposing those metrics to other systems. Metriql is an open source project that provides a headless BI system where you can define your metrics and share them with all of your other processes. In this episode Burak Kabakcı shares the story behind the project, how you can use it to create your metrics definitions, and the benefits of treating the semantic layer as a dedicated component of your platform.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nTransactions are a necessary feature for ensuring that a set of actions are all performed as a single unit of work. In streaming systems this is necessary to ensure that a set of messages or transformations are all executed together across different queues. In this episode Denis Rystsov explains how he added support for transactions to the Redpanda streaming engine. He discusses the use cases for transactions, the different strategies, semantics, and guarantees that they might need to support, and how his implementation ended up improving the performance of bulk write operations. This is an interesting deep dive into the internals of a high performance streaming engine and the details that are involved in building distributed systems.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nAerospike is a database engine that is designed to provide millisecond response times for queries across terabytes or petabytes. In this episode Chief Strategy Officer, Lenley Hensarling, explains how the ability to process these large volumes of information in real-time allows businesses to unlock entirely new capabilities. He also discusses the technical implementation that allows for such extreme performance and how the data model contributes to the scalability of the system. If you need to deal with massive data, at high velocities, in milliseconds, then Aerospike is definitely worth learning about.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe promise of online services is that they will make your life easier in exchange for collecting data about you. The reality is that they use more information than you realize for purposes that are not what you intended. There have been many attempts to harness all of the data that you generate for gaining useful insights about yourself, but they are generally difficult to set up and manage or require software development experience. The team at Prifina have built a platform that allows users to create their own personal data cloud and install applications built by developers that power useful experiences while keeping you in full control. In this episode Markus Lampinen shares the goals and vision of the company, the technical aspects of making it a reality, and the future vision for how services can be designed to respect user’s privacy while still providing compelling experiences.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe accuracy and availability of data has become critically important to the day-to-day operation of businesses. Similar to the practice of site reliability engineering as a means of ensuring consistent uptime of web services, there has been a new trend of building data reliability engineering practices in companies that rely heavily on their data. In this episode Egor Gryaznov explains how this practice manifests from a technical and organizational perspective and how you can start adopting it in your own teams.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nPython has beome the de facto language for working with data. That has brought with it a number of challenges having to do with the speed and scalability of working with large volumes of information.There have been many projects and strategies for overcoming these challenges, each with their own set of tradeoffs. In this episode Ehsan Totoni explains how he built the Bodo project to bring the speed and processing power of HPC techniques to the Python data ecosystem without requiring any re-work.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nBuilding, scaling, and maintaining the operational components of a machine learning workflow are all hard problems. Add the work of creating the model itself, and it’s not surprising that a majority of companies that could greatly benefit from machine learning have yet to either put it into production or see the value. Tristan Zajonc recognized the complexity that acts as a barrier to adoption and created the Continual platform in response. In this episode he shares his perspective on the benefits of declarative machine learning workflows as a means of accelerating adoption in businesses that don’t have the time, money, or ambition to build everything from scratch. He also discusses the technical underpinnings of what he is building and how using the data warehouse as a shared resource drastically shortens the time required to see value. This is a fascinating episode and Tristan’s work at Continual is likely to be the catalyst for a new stage in the machine learning community.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nBiology has been gaining a lot of attention in recent years, even before the pandemic. As an outgrowth of that popularity, a new field has grown up that pairs statistics and compuational analysis with scientific research, namely bioinformatics. This brings with it a unique set of challenges for data collection, data management, and analytical capabilities. In this episode Jillian Rowe shares her experience of working in the field and supporting teams of scientists and analysts with the data infrastructure that they need to get their work done. This is a fascinating exploration of the collaboration between data professionals and scientists.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe Cassandra database is one of the first open source options for globally scalable storage systems. Since its introduction in 2008 it has been powering systems at every scale. The community recently released a new major version that marks a milestone in its maturity and stability as a project and database. In this episode Ben Bromhead, CTO of Instaclustr, shares the challenges that the community has worked through, the work that went into the release, and how the stability and testing improvements are setting the stage for the future of the project.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nGartner analysts are tasked with identifying promising companies each year that are making an impact in their respective categories. For businesses that are working in the data management and analytics space they recognized the efforts of Timbr.ai, Soda Data, Nexla, and Tada. In this episode the founders and leaders of each of these organizations share their perspective on the current state of the market, and the challenges facing businesses and data professionals today.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe term "data platform" gets thrown around a lot, but have you stopped to think about what it actually means for you and your organization? In this episode Lior Gavish, Lior Solomon, and Atul Gupte share their view of what it means to have a data platform, discuss their experiences building them at various companies, and provide advice on how to treat them like a software product. This is a valuable conversation about how to approach the work of selecting the tools that you use to power your data systems and considerations for how they can be woven together for a unified experience across your various stakeholders.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe Presto project has become the de facto option for building scalable open source analytics in SQL for the data lake. In recent months the community has focused their efforts on making it the fastest possible option for running your analytics in the cloud. In this episode Dipti Borkar discusses the work that she and her team are doing at Ahana to simplify the work of running your own PrestoDB environment in the cloud. She explains how they are optimizin the runtime to reduce latency and increase query throughput, the ways that they are contributing back to the open source community, and the exciting improvements that are in the works to make Presto an even more powerful option for all of your analytics.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe reason that so much time and energy is spent on data integration is because of how our applications are designed. By making the software be the owner of the data that it generates, we have to go through the trouble of extracting the information to then be used elsewhere. The team at Cinchy are working to bring about a new paradigm of software architecture that puts the data as the central element. In this episode Dan DeMers, Cinchy’s CEO, explains how their concept of a "Dataware" platform eliminates the need for costly and error prone integration processes and the benefits that it can provide for transactional and analytical application design. This is a fascinating and unconventional approach to working with data, so definitely give this a listen to expand your thinking about how to build your systems.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe technological and social ecosystem of data engineering and data management has been reaching a stage of maturity recently. As part of this stage in our collective journey the focus has been shifting toward operation and automation of the infrastructure and workflows that power our analytical workloads. It is an encouraging sign for the industry, but it is still a complex and challenging undertaking. In order to make this world of DataOps more accessible and manageable the team at Nexla has built a platform that decouples the logical unit of data from the underlying mechanisms so that you can focus on the problems that really matter to your business. In this episode Saket Saurabh (CEO) and Avinash Shahdadpuri (CTO) share the story behind the Nexla platform, discuss the technical underpinnings, and describe how their concept of a Nexset simplifies the work of building data products for sharing within and between organizations.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData lakes have been gaining popularity alongside an increase in their sophistication and usability. Despite improvements in performance and data architecture they still require significant knowledge and experience to deploy and manage. In this episode Vikrant Dubey discusses his work on the Cuelake project which allows data analysts to build a lakehouse with SQL queries. By building on top of Zeppelin, Spark, and Iceberg he and his team at Cuebook have built an autoscaled cloud native system that abstracts the underlying complexity.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nA major concern that comes up when selecting a vendor or technology for storing and managing your data is vendor lock-in. What happens if the vendor fails? What if the technology can’t do what I need it to? Compilerworks set out to reduce the pain and complexity of migrating between platforms, and in the process added an advanced lineage tracking capability. In this episode Shevek, CTO of Compilerworks, takes us on an interesting journey through the many technical and social complexities that are involved in evolving your data platform and the system that they have built to make it a manageable task.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe vast majority of data tools and platforms that you hear about are designed for working with structured, text-based data. What do you do when you need to manage unstructured information, or build a computer vision model? Activeloop was created for exactly that purpose. In this episode Davit Buniatyan, founder and CEO of Activeloop, explains why he is spending his time and energy on building a platform to simplify the work of getting your unstructured data ready for machine learning. He discusses the inefficiencies that teams run into from having to reprocess data multiple times, his work on the open source Hub library to solve this problem for everyone, and his thoughts on the vast potential that exists for using computer vision to solve hard and meaningful problems.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nAll of the fancy data platform tools and shiny dashboards that you use are pointless if the consumers of your analysis don’t have trust in the answers. Stemma helps you establish and maintain that trust by giving visibility into who is using what data, annotating the reports with useful context, and understanding who is responsible for keeping it up to date. In this episode Mark Grover explains what he is building at Stemma, how it expands on the success of the Amundsen project, and why trust is the most important asset for data teams.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nEvery organization needs to be able to use data to answer questions about their business. The trouble is that the data is usually spread across a wide and shifting array of systems, from databases to dashboards. The other challenge is that even if you do find the information you are seeking, there might not be enough context available to determine how to use it or what it means. Castor is building a data discovery platform aimed at solving this problem, allowing you to search for and document details about everything from a database column to a business intelligence dashboard. In this episode CTO Amaury Dumoulin shares his perspective on the complexity of letting everyone in the company find answers to their questions and how Castor is designed to help.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData lake architectures have largely been biased toward batch processing workflows due to the volume of data that they are designed for. With more real-time requirements and the increasing use of streaming data there has been a struggle to merge fast, incremental updates with large, historical analysis. Vinoth Chandar helped to create the Hudi project while at Uber to address this challenge. By adding support for small, incremental inserts into large table structures, and building support for arbitrary update and delete operations the Hudi project brings the best of both worlds together. In this episode Vinoth shares the history of the project, how its architecture allows for building more frequently updated analytical queries, and the work being done to add a more polished experience to the data lake paradigm.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nCompanies of all sizes and industries are trying to use the data that they and their customers generate to survive and thrive in the modern economy. As a result, they are relying on a constantly growing number of data sources being accessed by an increasingly varied set of users. In order to help data consumers find and understand the data is available, and help the data producers understand how to prioritize their work, SelectStar has built a data discovery platform that brings everyone together. In this episode Shinji Kim shares her experience as a data professional struggling to collaborate with her colleagues and how that led her to founding a company to address that problem. She also discusses the combination of technical and social challenges that need to be solved for everyone to gain context and comprehension around their most valuable asset.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nEveryone expects data to be transmitted, processed, and updated instantly as more and more products integrate streaming data. The technology to make that possible has been around for a number of years, but the barriers to adoption have still been high due to the level of technical understanding and operational capacity that have been required to run at scale. Datastax has recently introduced a new managed offering for Pulsar workloads in the form of Astra Streaming that lowers those barriers and make stremaing workloads accessible to a wider audience. In this episode Prabhat Jha and Jonathan Ellis share the work that they have been doing to integrate streaming data into their managed Cassandra service. They explain how Pulsar is being used by their customers, the work that they have done to scale the administrative workload for multi-tenant environments, and the challenges of operating such a data intensive service at large scale. This is a fascinating conversation with a lot of useful lessons for anyone who wants to understand the operational aspects of Pulsar and the benefits that it can provide to data workloads.
\nIntroduction
\nHow did you get involved in the area of data management?
\nCan you describe what the Astra platform is and the story behind it?
\nHow does streaming fit into your overall product vision and the needs of your customers?
\nWhat was your selection process/criteria for adopting a streaming engine to complement your existing technology investment?
\nWhat are the core use cases that you are aiming to support with Astra Streaming?
\nCan you describe the architecture and automation of your hosted platform for Pulsar?
\nWhat are some of the additional tools that you have added to your distribution of Pulsar to simplify operation and use?
\nWhat are some of the sharp edges that you have had to sand down as you have scaled up your usage of Pulsar?
\nWhat is the process for someone to adopt and integrate with your Astra Streaming service?
\nOne of the capabilities that you highlight on the product page for Astra Streaming is the ability to execute machine learning workflows on data in flight. What are some of the supporting systems that are necessary to power that workflow?
\nWhat are the ways that you are engaging with and supporting the Pulsar community?
\nWhat are the most interesting, innovative, or unexpected ways that you have seen Astra used?
\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Astra?
\nWhen is Astra the wrong choice?
\nWhat do you have planned for the future of Astra?
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nCollecting and cleaning data is only useful if someone can make sense of it afterward. The latest evolution in the data ecosystem is the introduction of a dedicated metrics layer to help address the challenge of adding context and semantics to raw information. In this episode Nick Handel shares the story behind Transform, a new platform that provides a managed metrics layer for your data platform. He explains the challenges that occur when metrics are maintained across a variety of systems, the benefits of unifying them in a common access layer, and the potential that it unlocks for everyone in the business to confidently answer questions with data.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData quality is a concern that has been gaining attention alongside the rising importance of analytics for business success. Many solutions rely on hand-coded rules for catching known bugs, or statistical analysis of records to detect anomalies retroactively. While those are useful tools, it is far better to prevent data errors before they become an outsized issue. In this episode Gleb Mezhanskiy shares some strategies for adding quality checks at every stage of your development and deployment workflow to identify and fix problematic changes to your data before they get to production.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nSpecial Guest: Gleb Mezhanskiy.
","summary":"An interview with Gleb Mezhanskiy about his work at Datafold and how it has informed his strategies for proactive management of data quality across your organization.","date_published":"2021-07-19T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/7368e374-4299-43d4-a576-4999f38ed668.mp3","mime_type":"audio/mpeg","size_in_bytes":47762782,"duration_in_seconds":3666}]},{"id":"podlove-2021-07-16t11:34:11+00:00-c32c7341f40a9a5","title":"Low Code And High Quality Data Engineering For The Whole Organization With Prophecy","url":"https://www.dataengineeringpodcast.com/prophecy-low-code-data-engineering-episode-204","content_text":"Summary\nThere is a wealth of tools and systems available for processing data, but the user experience of integrating them and building workflows is still lacking. This is particularly important in large and complex organizations where domain knowledge and context is paramount and there may not be access to engineers for codifying that expertise. Raj Bains founded Prophecy to address this need by creating a UI first platform for building and executing data engineering workflows that orchestrates Airflow and Spark. Rather than locking your business logic into a proprietary storage layer and only exposing it through a drag-and-drop editor Prophecy synchronizes all of your jobs with source control, allowing an easy bi-directional interaction between code first and no-code experiences. In this episode he shares his motivations for creating Prophecy, how he is leveraging the magic of compilers to translate between UI and code oriented representations of logic, and the organizational benefits of having a cohesive experience designed to bring business users and domain experts into the same platform as data engineers and analysts.\nAnnouncements\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nYou listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!\nAre you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.\nAtlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription\nYour host is Tobias Macey and today I’m interviewing Raj Bains about Prophecy, a low-code data engineering platform built on Spark and Airflow\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what you are building at Prophecy and the story behind it?\nThere are a huge number of tools and recommended architectures for every variety of data need. Why is data engineering still such a complicated and challenging undertaking?\n\nWhat features and capabilities does Prophecy provide to help address those issues?\n\n\nWhat are the roles and use cases that you are focusing on serving with Prophecy?\nWhat are the elements of the data platform that Prophecy can replace?\nCan you describe how Prophecy is implemented?\n\nWhat was your selection criteria for the foundational elements of the platform?\nWhat would be involved in adopting other execution and orchestration engines?\n\n\nCan you describe the workflow of building a pipeline with Prophecy?\n\nWhat are the design and structural features that you have built to manage workflows as they scale in terms of technical and organizational complexity?\nWhat are the options for data engineers/data professionals to build and share reusable components across the organization?\n\n\nWhat are the most interesting, innovative, or unexpected ways that you have seen Prophecy used?\nWhat are the most interesting, unexpected, or challenging lessons that you have learned while working on Prophecy?\nWhen is Prophecy the wrong choice?\nWhat do you have planned for the future of Prophecy?\n\nContact Info\n\nLinkedIn\n@_raj_bains on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nClosing Announcements\n\nThank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.\nVisit the site to subscribe to the show, sign up for the mailing list, and read the show notes.\nIf you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.\nTo help other people find the show please leave a review on iTunes and tell your friends and co-workers\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\n\nLinks\n\nProphecy\nCUDA\nApache Hive\nHortonworks\nNoSQL\nNewSQL\nPaxos\nApache Impala\nAbInitio\nTeradata\nSnowflake\n\nPodcast Episode\n\n\nPresto\n\nPodcast Episode\n\n\nLinkedIn\nSpark\nDatabricks\nCron\nAirflow\nAstronomer\nAlteryx\nStreamsets\nAzure Data Factory\nApache Flink\n\nPodcast Episode\n\n\nPrefect\n\nPodcast Episode\n\n\nDagster\n\nPodcast Episode\nPodcast.__init__ Episode\n\n\nKubernetes Operator\nScala\nKafka\nAbstract Syntax Tree\nLanguage Server Protocol\nAmazon Deequ\ndbt\nTecton\n\nPodcast Episode\n\n\nInformatica\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"There is a wealth of tools and systems available for processing data, but the user experience of integrating them and building workflows is still lacking. This is particularly important in large and complex organizations where domain knowledge and context is paramount and there may not be access to engineers for codifying that expertise. Raj Bains founded Prophecy to address this need by creating a UI first platform for building and executing data engineering workflows that orchestrates Airflow and Spark. Rather than locking your business logic into a proprietary storage layer and only exposing it through a drag-and-drop editor Prophecy synchronizes all of your jobs with source control, allowing an easy bi-directional interaction between code first and no-code experiences. In this episode he shares his motivations for creating Prophecy, how he is leveraging the magic of compilers to translate between UI and code oriented representations of logic, and the organizational benefits of having a cohesive experience designed to bring business users and domain experts into the same platform as data engineers and analysts.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nWe have been building platforms and workflows to store, process, and analyze data since the earliest days of computing. Over that time there have been countless architectures, patterns, and "best practices" to make that task manageable. With the growing popularity of cloud services a new pattern has emerged and been dubbed the "Modern Data Stack". In this episode members of the GoDataDriven team, Guillermo Sanchez, Bram Ochsendorf, and Juan Perafan, explain the combinations of services that comprise this architecture, share their experiences working with clients to employ the stack, and the benefits of bringing engineers and business users together with data.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nEvery data project, whether it’s analytics, machine learning, or AI, starts with the work of data cleaning. This is a critical step and benefits from being accessible to the domain experts. Trifacta is a platform for managing your data engineering workflow to make curating, cleaning, and preparing your information more approachable for everyone in the business. In this episode CEO Adam Wilson shares the story behind the business, discusses the myriad ways that data wrangling is performed across the business, and how the platform is architected to adapt to the ever-changing landscape of data management tools. This is a great conversation about how deliberate user experience and platform design can make a drastic difference in the amount of value that a business can provide to their customers.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nAt the core of every data pipeline is an workflow manager (or several). Deploying, managing, and scaling that orchestration can consume a large fraction of a data team’s energy so it is important to pick something that provides the power and flexibility that you need. SaaSGlue is a managed service that lets you connect all of your systems, across clouds and physical infrastructure, and spanning all of your programming languages. In this episode Bart and Rich Wood explain how SaaSGlue is architected to allow for a high degree of flexibility in usage and deployment, their experience building a business with family, and how you can get started using it today. This is a fascinating platform with an endless set of use cases and a great team of people behind it.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData integration in the form of extract and load is the critical first step of every data project. There are a large number of commercial and open source projects that offer that capability but it is still far from being a solved problem. One of the most promising community efforts is that of the Singer ecosystem, but it has been plagued by inconsistent quality and design of plugins. In this episode the members of the Meltano project share the work they are doing to improve the discovery, quality, and capabilities of Singer taps and targets. They explain their work on the Meltano Hub and the Singer SDK and their long term goals for the Singer community.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nWhile the overall concept of timeseries data is uniform, its usage and applications are far from it. One of the most demanding applications of timeseries data is for application and server monitoring due to the problem of high cardinality. In his quest to build a generalized platform for managing timeseries Paul Dix keeps getting pulled back into the monitoring arena. In this episode he shares the history of the InfluxDB project, the business that he has helped to build around it, and the architectural aspects of the engine that allow for its flexibility in managing various forms of timeseries data. This is a fascinating exploration of the technical and organizational evolution of the Influx Data platform, with some promising glimpses of where they are headed in the near future.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData Engineering is a broad and constantly evolving topic, which makes it difficult to teach in a concise and effective manner. Despite that, Daniel Molnar and Peter Fabian started the Pipeline Academy to do exactly that. In this episode they reflect on the lessons that they learned while teaching the first cohort of their bootcamp how to be effective data engineers. By focusing on the fundamentals, and making everyone write code, they were able to build confidence and impart the importance of context for their students.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe database is the core of any system because it holds the data that drives your entire experience. We spend countless hours designing the data model, updating engine versions, and tuning performance. But how confident are you that you have configured it to be as performant as possible, given the dozens of parameters and how they interact with each other? Andy Pavlo researches autonomous database systems, and out of that research he created OtterTune to find the optimal set of parameters to use for your specific workload. In this episode he explains how the system works, the challenge of scaling it to work across different database engines, and his hopes for the future of database systems.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nWorking with unstructured data has typically been a motivation for a data lake. The challenge is imposing enough order on the platform to make it useful. Kirk Marple has spent years working with data systems and the media industry, which inspired him to build a platform for automatically organizing your unstructured assets to make them more valuable. In this episode he shares the goals of the Unstruk Data Warehouse, how it is architected to extract asset metadata and build a searchable knowledge graph from the information, and the myriad ways that the system can be used. If you are wondering how to deal with all of the information that doesn’t fit in your databases or data warehouses, then this episode is for you.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nWhen you build a machine learning model, the first step is always to load your data. Typically this means downloading files from object storage, or querying a database. To speed up the process, why not build the model inside the database so that you don’t have to move the information? In this episode Paige Roberts explains the benefits of pushing the machine learning processing into the database layer and the approach that Vertica has taken for their implementation. If you are looking for a way to speed up your experimentation, or an easy way to apply AutoML then this conversation is for you.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nGoogle pioneered an impressive number of the architectural underpinnings of the broader big data ecosystem. Now they offer the technologies that they run internally to external users of their cloud platform. In this episode Lak Lakshmanan enumerates the variety of services that are available for building your various data processing and analytical systems. He shares some of the common patterns for building pipelines to power business intelligence dashboards, machine learning applications, and data warehouses. If you’ve ever been overwhelmed or confused by the array of services available in the Google Cloud Platform then this episode is for you.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe way to build maintainable software and systems is through composition of individual pieces. By making those pieces high quality and flexible they can be used in surprising ways that the original creators couldn’t have imagined. One such component that has gone above and beyond its originally envisioned use case is BookKeeper, a distributed storage system that is optimized for durability and speed. In this episode Matteo Merli shares the story behind the creation of BookKeeper, the various ways that it is being used today, and the architectural aspects that make it such a strong building block for projects such as Pulsar. He also shares some of the other interesting systems that have been built on top of it and an amusing war story of running it at scale in its early years.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nSQL is the most widely used language for working with data, and yet the tools available for writing and collaborating on it are still clunky and inefficient. Frustrated with the lack of a modern IDE and collaborative workflow for managing the SQL queries and analysis of their big data environments, the team at Pinterest created Querybook. In this episode Justin Mejorada-Pier and Charlie Gu share the story of how the initial prototype for a data catalog ended up as one of their most widely used interfaces to their analytical data. They also discuss the unique combination of features that it offers, how it is implemented, and the path to releasing it as open source. Querybook is an impressive and unique piece of technology that is well worth exploring, so listen and try it out today.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nEvery part of the business relies on data, yet only a small team has the context and expertise to build and maintain workflows and data pipelines to transform, clean, and integrate it. In order for the true value of your data to be realized without burning out your engineers you need a way for everyone to get access to the information they care about. To help make that a more tractable problem Blake Burch co-founded Shipyard. In this episode he explains the utility of a low code solution that lets non engineers create their own self-serve pipelines, how the Shipyard platform is designed to make that possible, and how it allows engineers to create reusable tasks to satisfy the specific needs of the business. This is an interesting conversation about how to make data more accessible and more useful by improving the user experience of the tools that we create.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe data warehouse has become the focal point of the modern data platform. With increased usage of data across businesses, and a diversity of locations and environments where data needs to be managed, the warehouse engine needs to be fast and easy to manage. Yellowbrick is a data warehouse platform that was built from the ground up for speed, and can work across clouds and all the way to the edge. In this episode CTO Mark Cusack explains how the engine is architected, the benefits that speed and predictable pricing has for the organization, and how you can simplify your platform by putting the warehouse close to the data, instead of the other way around.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nMachine learning models use vectors as the natural mechanism for representing their internal state. The problem is that in order for the models to integrate with external systems their internal state has to be translated into a lower dimension. To eliminate this impedance mismatch Edo Liberty founded Pinecone to build database that works natively with vectors. In this episode he explains how this technology will allow teams to accelerate the speed of innovation, how vectors make it possible to build more advanced search functionality, and how Pinecone is architected. This is an interesting conversation about how reconsidering the architecture of your systems can unlock impressive new capabilities.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData governance is a phrase that means many different things to many different people. This is because it is actually a concept that encompasses the entire lifecycle of data, across all of the people in an organization who interact with it. Stijn Christiaens co-founded Collibra with the goal of addressing the wide variety of technological aspects that are necessary to realize such an important and expansive process. In this episode he shares his thoughts on the balance between human and technological processes that are necessary for a well-managed data governance strategy, how Collibra is designed to aid in that endeavor, and his experiences using the platform that his company is building to help power the company. This is an excellent conversation that spans the engineering and philosophical complexities of an important and ever-present aspect of working with data.
\nHello and welcome to the Data Engineering Podcast, the show about modern data management
\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
\nRudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
\nYour host is Tobias Macey and today I’m interviewing Stijn Christiaens about data governance in the enterprise and how Collibra applies the lessons learned from their customers to their own business
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData lineage is the common thread that ties together all of your data pipelines, workflows, and systems. In order to get a holistic understanding of your data quality, where errors are occurring, or how a report was constructed you need to track the lineage of the data from beginning to end. The complicating factor is that every framework, platform, and product has its own concepts of how to store, represent, and expose that information. In order to eliminate the wasted effort of building custom integrations every time you want to combine lineage information across systems Julien Le Dem introduced the OpenLineage specification. In this episode he explains his motivations for starting the effort, the far-reaching benefits that it can provide to the industry, and how you can start integrating it into your data platform today. This is an excellent conversation about how competing companies can still find mutual benefit in co-operating on open standards.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThere is a lot of attention on the database market and cloud data warehouses. While they provide a measure of convenience, they also require you to sacrifice a certain amount of control over your data. If you want to build a warehouse that gives you both control and flexibility then you might consider building on top of the venerable PostgreSQL project. In this episode Thomas Richter and Joshua Drake share their advice on how to build a production ready data warehouse with Postgres.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nBuilding an API for real-time data is a challenging project. Making it robust, scalable, and fast is a full time job. The team at Tinybird wants to make it easy to turn a continuous stream of data into a production ready API or data product. In this episode CEO Jorge Sancha explains how they have architected their system to handle high data throughput and fast response times, and why they have invested heavily in Clickhouse as the core of their platform. This is a great conversation about the challenges of building a maintainable business from a technical and product perspective.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nSpark is one of the most well-known frameworks for data processing, whether for batch or streaming, ETL or ML, and at any scale. Because of its popularity it has been deployed on every kind of platform you can think of. In this episode Jean-Yves Stephan shares the work that he is doing at Data Mechanics to make it sing on Kubernetes. He explains how operating in a cloud-native context simplifies some aspects of running the system while complicating others, how it simplifies the development and experimentation cycle, and how you can get a head start using their pre-built Spark container. This is a great conversation for understanding how new ways of operating systems can have broader impacts on how they are being used.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe Data industry is changing rapidly, and one of the most active areas of growth is automation of data workflows. Taking cues from the DevOps movement of the past decade data professionals are orienting around the concept of DataOps. More than just a collection of tools, there are a number of organizational and conceptual changes that a proper DataOps approach depends on. In this episode Kevin Stumpf, CTO of Tecton, Maxime Beauchemin, CEO of Preset, and Lior Gavish, CTO of Monte Carlo, discuss the grand vision and present realities of DataOps. They explain how to think about your data systems in a holistic and maintainable fashion, the security challenges that threaten to derail your efforts, and the power of using metadata as the foundation of everything that you do. If you are wondering how to get control of your data platforms and bring all of your stakeholders onto the same page then this conversation is for you.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe reason for collecting, cleaning, and organizing data is to make it usable by the organization. One of the most common and widely used methods of access is through a business intelligence dashboard. Superset is an open source option that has been gaining popularity due to its flexibility and extensible feature set. In this episode Maxime Beauchemin discusses how data engineers can use Superset to provide self service access to data and deliver analytics. He digs into how it integrates with your data stack, how you can extend it to fit your use case, and why open source systems are a good choice for your business intelligence. If you haven’t already tried out Superset then this conversation is well worth your time. Give it a listen and then take it for a test drive today.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nMost of the time when you think about a data pipeline or ETL job what comes to mind is a purely mechanistic progression of functions that move data from point A to point B. Sometimes, however, one of those transformations is actually a full-fledged machine learning project in its own right. In this episode Tal Galfsky explains how he and the team at Cherre tackled the problem of messy data for Addresses by building a natural language processing and entity resolution system that is served as an API to the rest of their pipelines. He discusses the myriad ways that addresses are incomplete, poorly formed, and just plain wrong, why it was a big enough pain point to invest in building an industrial strength solution for it, and how it actually works under the hood. After listening to this you’ll look at your data pipelines in a new light and start to wonder how you can bring more advanced strategies into the cleaning and transformation process.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\n"Business as usual" is changing, with more companies investing in data as a first class concern. As a result, the data team is growing and introducing more specialized roles. In this episode Josh Benamram, CEO and co-founder of Databand, describes the motivations for these emerging roles, how these positions affect the team dynamics, and the types of visibility that they need into the data platform to do their jobs effectively. He also talks about how his experience working with these teams informs his work at Databand. If you are wondering how to apply your talents and interests to working with data then this episode is a must listen.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nOne of the biggest obstacles to success in delivering data products is cross-team collaboration. Part of the problem is the difference in the information that each role requires to do their job and where they expect to find it. This introduces a barrier to communication that is difficult to overcome, particularly in teams that have not reached a significant level of maturity in their data journey. In this episode Prukalpa Sankar shares her experiences across multiple attempts at building a system that brings everyone onto the same page, ultimately bringing her to found Atlan. She explains how the design of the platform is informed by the needs of managing data projects for large and small teams across her previous roles, how it integrates with your existing systems, and how it can work to bring everyone onto the same page.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData quality is on the top of everyone’s mind recently, but getting it right is as challenging as ever. One of the contributing factors is the number of people who are involved in the process and the potential impact on the business if something goes wrong. In this episode Maarten Masschelein and Tom Baeyens share the work they are doing at Soda to bring everyone on board to make your data clean and reliable. They explain how they started down the path of building a solution for managing data quality, their philosophy of how to empower data engineers with well engineered open source tools that integrate with the rest of the platform, and how to bring all of the stakeholders onto the same page to make your data great. There are many aspects of data quality management and it’s always a treat to learn from people who are dedicating their time and energy to solving it for everyone.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe world of business is becoming increasingly dependent on information that is accurate up to the minute. For analytical systems, the only way to provide this reliably is by implementing change data capture (CDC). Unfortunately, this is a non-trivial undertaking, particularly for teams that don’t have extensive experience working with streaming data and complex distributed systems. In this episode Raghu Murthy, founder and CEO of Datacoral, does a deep dive on how he and his team manage change data capture pipelines in production.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe team at DoorDash has a complex set of optimization challenges to deal with using data that they collect from a multi-sided marketplace. In order to handle the volume and variety of information that they use to run and improve the business the data team has to build a platform that analysts and data scientists can use in a self-service manner. In this episode the head of data platform for DoorDash, Sudhir Tonse, discusses the technologies that they are using, the approach that they take to adding new systems, and how they think about priorities for what to support for the whole company vs what to leave as a specialized concern for a single team. This is a valuable look at how to manage a large and growing data platform with that supports a variety of teams with varied and evolving needs.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nA majority of the time spent in data engineering is copying data between systems to make the information available for different purposes. This introduces challenges such as keeping information synchronized, managing schema evolution, building transformations to match the expectations of the destination systems. H.O. Maycotte was faced with these same challenges but at a massive scale, leading him to question if there is a better way. After tasking some of his top engineers to consider the problem in a new light they created the Pilosa engine. In this episode H.O. explains how using Pilosa as the core he built the Molecula platform to eliminate the need to copy data between systems in able to make it accessible for analytical and machine learning purposes. He also discusses the challenges that he faces in helping potential users and customers understand the shift in thinking that this creates, and how the system is architected to make it possible. This is a fascinating conversation about what the future looks like when you revisit your assumptions about how systems are designed.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe process of building and deploying machine learning projects requires a staggering number of systems and stakeholders to work in concert. In this episode Yaron Haviv, co-founder of Iguazio, discusses the complexities inherent to the process, as well as how he has worked to democratize the technologies necessary to make machine learning operations maintainable.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData integration is a critical piece of every data pipeline, yet it is still far from being a solved problem. There are a number of managed platforms available, but the list of options for an open source system that supports a large variety of sources and destinations is still embarrasingly short. The team at Airbyte is adding a new entry to that list with the goal of making robust and easy to use data integration more accessible to teams who want or need to maintain full control of their data. In this episode co-founders John Lafleur and Michel Tricot share the story of how and why they created Airbyte, discuss the project’s design and architecture, and explain their vision of what an open soure data integration platform should offer. If you are struggling to maintain your extract and load pipelines or spending time on integrating with a new system when you would prefer to be working on other projects then this is definitely a conversation worth listening to.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nEvery business aims to be data driven, but not all of them succeed in that effort. In order to be able to truly derive insights from the data that an organization collects, there are certain foundational capabilities that they need to have capacity for. In order to help more businesses build those foundations, Tarush Aggarwal created 5xData, offering collaborative workshops to assist in setting up the technical and organizational systems that are necessary to succeed. In this episode he shares his thoughts on the core elements that are necessary for every business to be data driven, how he is helping companies incorporate those capabilities into their structure, and the ongoing support that he is providing through a network of mastermind groups. This is a great conversation about the initial steps that every group should be thinking of as they start down the road to making data informed decisions.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nWith all of the tools and services available for building a data platform it can be difficult to separate the signal from the noise. One of the best ways to get a true understanding of how a technology works in practice is to hear from people who are running it in production. In this episode Zeeshan Qureshi and Michelle Ark share their experiences using DBT to manage the data warehouse for Shopify. They explain how the structured the project to allow for multiple teams to collaborate in a scalable manner, the additional tooling that they added to address the edge cases that they have run into, and the optimizations that they baked into their continuous integration process to provide fast feedback and reduce costs. This is a great conversation about the lessons learned from real world use of a specific technology and how well it lives up to its promises.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nCollecting and processing metrics for monitoring use cases is an interesting data problem. It is eminently possible to generate millions or billions of data points per second, the information needs to be propagated to a central location, processed, and analyzed in timeframes on the order of milliseconds or single-digit seconds, and the consumers of the data need to be able to query the information quickly and flexibly. As the systems that we build continue to grow in scale and complexity the need for reliable and manageable monitoring platforms increases proportionately. In this episode Rob Skillington, CTO of Chronosphere, shares his experiences building metrics systems that provide observability to companies that are operating at extreme scale. He describes how the M3DB storage engine is designed to manage the pressures of a critical system component, the inherent complexities of working with telemetry data, and the motivating factors that are contributing to the growing need for flexibility in querying the collected metrics. This is a fascinating conversation about an area of data management that is often taken for granted.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nBusinesses often need to be able to ingest data from their customers in order to power the services that they provide. For each new source that they need to integrate with it is another custom set of ETL tasks that they need to maintain. In order to reduce the friction involved in supporting new data transformations David Molot and Hassan Syyid built the Hotlue platform. In this episode they describe the data integration challenges facing many B2B companies, how their work on the Hotglue platform simplifies their efforts, and how they have designed the platform to make these ETL workloads embeddable and self service for end users.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe data warehouse has become the central component of the modern data stack. Building on this pattern, the team at Hightouch have created a platform that synchronizes information about your customers out to third party systems for use by marketing and sales teams. In this episode Tejas Manohar explains the benefits of sourcing customer data from one location for all of your organization to use, the technical challenges of synchronizing the data to external systems with varying APIs, and the workflow for enabling self-service access to your customer data by your marketing teams. This is an interesting conversation about the importance of the data warehouse and how it can be used beyond just internal analytics.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nAs data professionals we have a number of tools available for storing, processing, and analyzing data. We also have tools for collaborating on software and analysis, but collaborating on data is still an underserved capability. Gavin Mendel-Gleason encountered this problem first hand while working on the Sesshat databank, leading him to create TerminusDB and TerminusHub. In this episode he explains how the TerminusDB system is architected to provide a versioned graph storage engine that allows for branching and merging of data sets, how that opens up new possibilities for individuals and teams to work together on building new data repositories. This is a fascinating conversation on the technical challenges involved, the opportunities that such as system provides, and the complexities inherent to building a successful business on open source.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nAs more organizations are gaining experience with data management and incorporating analytics into their decision making, their next move is to adopt machine learning. In order to make those efforts sustainable, the core capability they need is for data scientists and analysts to be able to build and deploy features in a self service manner. As a result the feature store is becoming a required piece of the data platform. To fill that need Kevin Stumpf and the team at Tecton are building an enterprise feature store as a service. In this episode he explains how his experience building the Michelanagelo platform at Uber has informed the design and architecture of Tecton, how it integrates with your existing data systems, and the elements that are required for well engineered feature store.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nOne of the core responsibilities of data engineers is to manage the security of the information that they process. The team at Satori has a background in cybersecurity and they are using the lessons that they learned in that field to address the challenge of access control and auditing for data governance. In this episode co-founder and CTO Yoav Cohen explains how the Satori platform provides a proxy layer for your data, the challenges of managing security across disparate storage systems, and their approach to building a dynamic data catalog based on the records that your organization is actually using. This is an interesting conversation about the intersection of data and security and the lessons that can be learned in each direction.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData governance is a term that encompasses a wide range of responsibilities, both technical and process oriented. One of the more complex aspects is that of access control to the data assets that an organization is responsible for managing. The team at Immuta has built a platform that aims to tackle that problem in a flexible and maintainable fashion so that data teams can easily integrate authorization, data masking, and privacy enhancing technologies into their data infrastructure. In this episode Steve Touw and Stephen Bailey share what they have built at Immuta, how it is implemented, and how it streamlines the workflow for everyone involved in working with sensitive data. If you are starting down the path of implementing a data governance strategy then this episode will provide a great overview of what is involved.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nAs a data engineer you’re familiar with the process of collecting data from databases, customer data platforms, APIs, etc. At YipitData they rely on a variety of alternative data sources to inform investment decisions by hedge funds and businesses. In this episode Andrew Gross, Bobby Muldoon, and Anup Segu describe the self service data platform that they have built to allow data analysts to own the end-to-end delivery of data projects and how that has allowed them to scale their output. They share the journey that they went through to build a scalable and maintainable system for web scraping, how to make it reliable and resilient to errors, and the lessons that they learned in the process. This was a great conversation about real world experiences in building a successful data-oriented business.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nBuilding data products are complicated by the fact that there are so many different stakeholders with competing goals and priorities. It is also challenging because of the number of roles and capabilities that are necessary to go from idea to delivery. Different organizations have tried a multitude of organizational strategies to improve the success rate of these data teams with varying levels of success. In this episode Jesse Anderson shares the lessons that he has learned while working with dozens of businesses across industries to determine the team structures and communication styles that have generated the best results. If you are struggling to deliver value from big data, or just starting down the path of building the organizational capacity to turn raw information into valuable products then this is a conversation that you don’t want to miss.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe first stage of every good pipeline is to perform data integration. With the increasing pace of change and the need for up to date analytics the need to integrate that data in near real time is growing. With the improvements and increased variety of options for streaming data engines and improved tools for change data capture it is possible for data teams to make that goal a reality. However, despite all of the tools and managed distributions of those streaming engines it is still a challenge to build a robust and reliable pipeline for streaming data integration, especially if you need to expose those capabilities to non-engineers. In this episode Ido Friedman, CTO of Equalum, explains how they have built a no-code platform to make integration of streaming data and change data capture feeds easier to manage. He discusses the challenges that are inherent in the current state of CDC technologies, how they have architected their system to integrate well with existing data platforms, and how to build an appropriate level of abstraction for such a complex problem domain. If you are struggling with streaming data integration and change data capture then this interview is definitely worth a listen.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nOne of the oldest aphorisms about data is "garbage in, garbage out", which is why the current boom in data quality solutions is no surprise. With the growth in projects, platforms, and services that aim to help you establish and maintain control of the health and reliability of your data pipelines it can be overwhelming to stay up to date with how they all compare. In this episode Egor Gryaznov, CTO of Bigeye, joins the show to explore the landscape of data quality companies, the general strategies that they are using, and what problems they solve. He also shares how his own product is designed and the challenges that are involved in building a system to help data engineers manage the complexity of a data platform. If you are wondering how to get better control of your own pipelines and the traps to avoid then this episode is definitely worth a listen.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe core mission of data engineers is to provide the business with a way to ask and answer questions of their data. This often takes the form of business intelligence dashboards, machine learning models, or APIs on top of a cleaned and curated data set. Despite the rapid progression of impressive tools and products built to fulfill this mission, it is still an uphill battle to tie everything together into a cohesive and reliable platform. At Isima they decided to reimagine the entire ecosystem from the ground up and built a single unified platform to allow end-to-end self service workflows from data ingestion through to analysis. In this episode CEO and co-founder of Isima Darshan Rawal explains how the biOS platform is architected to enable ease of use, the challenges that were involved in building an entirely new system from scratch, and how it can integrate with the rest of your data platform to allow for incremental adoption. This was an interesting and contrarian take on the current state of the data management industry and is worth a listen to gain some additional perspective.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nA data catalog is a critical piece of infrastructure for any organization who wants to build analytics products, whether internal or external. While there are a number of platforms available for building that catalog, many of them are either difficult to deploy and integrate, or expensive to use at scale. In this episode Grant Seward explains how he built Tree Schema to be an easy to use and cost effective option for organizations to build their data catalogs. He also shares the internal architecture, how he approached the design to make it accessible and easy to use, and how it autodiscovers the schemas and metadata for your source systems.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData lakes are gaining popularity due to their flexibility and reduced cost of storage. Along with the benefits there are some additional complexities to consider, including how to safely integrate new data sources or test out changes to existing pipelines. In order to address these challenges the team at Treeverse created LakeFS to introduce version control capabilities to your storage layer. In this episode Einat Orr and Oz Katz explain how they implemented branching and merging capabilities for object storage, best practices for how to use versioning primitives to introduce changes to your data lake, how LakeFS is architected, and how you can start using it for your own data platform.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nOne of the most challenging aspects of building a data platform has nothing to do with pipelines and transformations. If you are putting your workflows into production, then you need to consider how you are going to implement data security, including access controls and auditing. Different databases and storage systems all have their own method of restricting access, and they are not all compatible with each other. In order to simplify the process of securing your data in the Cloud Manav Mital created Cyral to provide a way of enforcing security as code. In this episode he explains how the system is architected, how it can help you enforce compliance, and what is involved in getting it integrated with your existing systems. This was a good conversation about an aspect of data management that is too often left as an afterthought.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nIn order for analytics and machine learning projects to be useful, they require a high degree of data quality. To ensure that your pipelines are healthy you need a way to make them observable. In this episode Barr Moses and Lior Gavish, co-founders of Monte Carlo, share the leading causes of what they refer to as data downtime and how it manifests. They also discuss methods for gaining visibility into the flow of data through your infrastructure, how to diagnose and prevent potential problems, and what they are building at Monte Carlo to help you maintain your data’s uptime.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nBusiness intelligence efforts are only as useful as the outcomes that they inform. Power BI aims to reduce the time and effort required to go from information to action by providing an interface that encourages rapid iteration. In this episode Rob Collie shares his enthusiasm for the Power BI platform and how it stands out from other options. He explains how he helped to build the platform during his time at Microsoft, and how he continues to support users through his work at Power Pivot Pro. Rob shares some useful insights gained through his consulting work, and why he considers Power BI to be the best option on the market today for business analytics.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nAnalytical workloads require a well engineered and well maintained data integration process to ensure that your information is reliable and up to date. Building a real-time pipeline for your data lakes and data warehouses is a non-trivial effort, requiring a substantial investment of time and energy. Meroxa is a new platform that aims to automate the heavy lifting of change data capture, monitoring, and data loading. In this episode founders DeVaris Brown and Ali Hamidi explain how their tenure at Heroku informed their approach to making data integration self service, how the platform is architected, and how they have designed their system to adapt to the continued evolution of the data ecosystem.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nKafka has become a de facto standard interface for building decoupled systems and working with streaming data. Despite its widespread popularity, there are numerous accounts of the difficulty that operators face in keeping it reliable and performant, or trying to scale an installation. To make the benefits of the Kafka ecosystem more accessible and reduce the operational burden, Alexander Gallego and his team at Vectorized created the Red Panda engine. In this episode he explains how they engineered a drop-in replacement for Kafka, replicating the numerous APIs, that can scale more easily and deliver consistently low latencies with a much lower hardware footprint. He also shares some of the areas of innovation that they have found to help foster the next wave of streaming applications while working within the constraints of the existing Kafka interfaces. This was a fascinating conversation with an energetic and enthusiastic engineer and founder about the challenges and opportunities in the realm of streaming data.
\n@vectorizedio Company Twitter Accn’t
\nConcord alternative to Flink
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData engineering is a constantly growing and evolving discipline. There are always new tools, systems, and design patterns to learn, which leads to a great deal of confusion for newcomers. Daniel Molnar has dedicated his time to helping data professionals get back to basics through presentations at conferences and meetups, and with his most recent endeavor of building the Pipeline Data Engineering Academy. In this episode he shares advice on how to cut through the noise, which principles are foundational to building a successful career as a data engineer, and his approach to educating the next generation of data practitioners. This was a useful conversation for anyone working with data who has found themselves spending too much time chasing the latest trends and wishes to develop a more focused approach to their work.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nIn memory computing provides significant performance benefits, but brings along challenges for managing failures and scaling up. Hazelcast is a platform for managing stateful in-memory storage and computation across a distributed cluster of commodity hardware. On top of this foundation, the Hazelcast team has also built a streaming platform for reliable high throughput data transmission. In this episode Dale Kim shares how Hazelcast is implemented, the use cases that it enables, and how it complements on-disk data management systems.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nDatabases are limited in scope to the information that they directly contain. For analytical use cases you often want to combine data across multiple sources and storage locations. This frequently requires cumbersome and time-consuming data integration. To address this problem Martin Traverso and his colleagues at Facebook built the Presto distributed query engine. In this episode he explains how it is designed to allow for querying and combining data where it resides, the use cases that such an architecture unlocks, and the innovative ways that it is being employed at companies across the world. If you need to work with data in your cloud data lake, your on-premise database, or a collection of flat files, then give this episode a listen and then try out Presto today.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData warehouse technology has been around for decades and has gone through several generational shifts in that time. The current trends in data warehousing are oriented around cloud native architectures that take advantage of dynamic scaling and the separation of compute and storage. Firebolt is taking that a step further with a core focus on speed and interactivity. In this episode CEO and founder Eldad Farkash explains how the Firebolt platform is architected for high throughput, their simple and transparent pricing model to encourage widespread use, and the use cases that it unlocks through interactive query speeds.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nIn order to scale the use of data across an organization there are a number of challenges related to discovery, governance, and integration that need to be solved. The key to those solutions is a robust and flexible metadata management system. LinkedIn has gone through several iterations on the most maintainable and scalable approach to metadata, leading them to their current work on DataHub. In this episode Mars Lan and Pardhu Gunnam explain how they designed the platform, how it integrates into their data platforms, and how it is being used to power data discovery and analytics at LinkedIn.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nMost databases are designed to work with textual data, with some special purpose engines that support domain specific formats. TileDB is a data engine that was built to support every type of data by using multi-dimensional arrays as the foundational primitive. In this episode the creator and founder of TileDB shares how he first started working on the underlying technology and the benefits of using a single engine for efficiently storing and querying any form of data. He also discusses the shifts in database architectures from vertically integrated monoliths to separately deployed layers, and the approach he is taking with TileDB cloud to embed the authorization into the storage engine, while providing a flexible interface for compute. This was a great conversation about a different approach to database architecture and how that enables a more flexible way to store and interact with data to power better data sharing and new opportunities for blending specialized domains.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nEvent based data is a rich source of information for analytics, unless none of the event structures are consistent. The team at Iteratively are building a platform to manage the end to end flow of collaboration around what events are needed, how to structure the attributes, and how they are captured. In this episode founders Patrick Thompson and Ondrej Hrebicek discuss the problems that they have experienced as a result of inconsistent event schemas, how the Iteratively platform integrates the definition, development, and delivery of event data, and the benefits of elevating the visibility of event data for improving the effectiveness of the resulting analytics. If you are struggling with inconsistent implementations of event data collection, lack of clarity on what attributes are needed, and how it is being used then this is definitely a conversation worth following.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nFinding connections between data and the entities that they represent is a complex problem. Graph data models and the applications built on top of them are perfect for representing relationships and finding emergent structures in your information. In this episode Denise Gosnell and Matthias Broecheler discuss their recent book, the Practitioner’s Guide To Graph Data, including the fundamental principles that you need to know about graph structures, the current state of graph support in database engines, tooling, and query languages, as well as useful tips on potential pitfalls when putting them into production. This was an informative and enlightening conversation with two experts on graph data applications that will help you start on the right track in your own projects.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nA majority of the scalable data processing platforms that we rely on are built as distributed systems. This brings with it a vast number of subtle ways that errors can creep in. Kyle Kingsbury created the Jepsen framework for testing the guarantees of distributed data processing systems and identifying when and why they break. In this episode he shares his approach to testing complex systems, the common challenges that are faced by engineers who build them, and why it is important to understand their limitations. This was a great look at some of the underlying principles that power your mission critical workloads.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nWind energy is an important component of an ecologically friendly power system, but there are a number of variables that can affect the overall efficiency of the turbines. Michael Tegtmeier founded Turbit Systems to help operators of wind farms identify and correct problems that contribute to suboptimal power outputs. In this episode he shares the story of how he got started working with wind energy, the system that he has built to collect data from the individual turbines, and how he is using machine learning to provide valuable insights to produce higher energy outputs. This was a great conversation about using data to improve the way the world works.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe first stage of every data pipeline is extracting the information from source systems. There are a number of platforms for managing data integration, but there is a notable lack of a robust and easy to use open source option. The Meltano project is aiming to provide a solution to that situation. In this episode, project lead Douwe Maan shares the history of how Meltano got started, the motivation for the recent shift in focus, and how it is implemented. The Singer ecosystem has laid the groundwork for a great option to empower teams of all sizes to unlock the value of their Data and Meltano is building the reamining structure to make it a fully featured contender for proprietary systems.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThere are an increasing number of use cases for real time data, and the systems to power them are becoming more mature. Once you have a streaming platform up and running you need a way to keep an eye on it, including observability, discovery, and governance of your data. That’s what the Lenses.io DataOps platform is built for. In this episode CTO Andrew Stevenson discusses the challenges that arise from building decoupled systems, the benefits of using SQL as the common interface for your data, and the metrics that need to be tracked to keep the overall system healthy. Observability and governance of streaming data requires a different approach than batch oriented workflows, and this episode does an excellent job of outlining the complexities involved and how to address them.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nWe have machines that can listen to and process human speech in a variety of languages, but dealing with unstructured sounds in our environment is a much greater challenge. The team at Audio Analytic are working to impart a sense of hearing to our myriad devices with their sound recognition technology. In this episode Dr. Chris Mitchell and Dr. Thomas le Cornu describe the challenges that they are faced with in the collection and labelling of high quality data to make this possible, including the lack of a publicly available collection of audio samples to work from, the need for custom metadata throughout the processing pipeline, and the need for customized data processing tools for working with sound data. This was a great conversation about the complexities of working in a niche domain of data analysis and how to build a pipeline of high quality data from collection to analysis.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe majority of analytics platforms are focused on use internal to an organization by business stakeholders. As the availability of data increases and overall literacy in how to interpret it and take action improves there is a growing need to bring business intelligence use cases to a broader audience. GoodData is a platform focused on simplifying the work of bringing data to employees and end users. In this episode Sheila Jung and Philip Farr discuss how the GoodData platform is being used, how it is architected to provide scalable and performant analytics, and how it integrates into customer’s data platforms. This was an interesting conversation about a different approach to business intelligence and the importance of expanded access to data.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nMachine learning is a process driven by iteration and experimentation which requires fast and easy access to relevant features of the data being processed. In order to reduce friction in the process of developing and delivering models there has been a recent trend toward building a dedicated feature. In this episode Simba Khadder discusses his work at StreamSQL building a feature store to make creation, discovery, and monitoring of features fast and easy to manage. He describes the architecture of the system, the benefits of streaming data for machine learning, and how a feature store provides a useful interface between data engineers and machine learning engineers to reduce communication overhead.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe landscape of data management and processing is rapidly changing and evolving. There are certain foundational elements that have remained steady, but as the industry matures new trends emerge and gain prominence. In this episode Astasia Myers of Redpoint Ventures shares her perspective as an investor on which categories she is paying particular attention to for the near to medium term. She discusses the work being done to address challenges in the areas of data quality, observability, discovery, and streaming. This is a useful conversation to gain a macro perspective on where businesses are looking to improve their capabilities to work with data.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData lakes offer a great deal of flexibility and the potential for reduced cost for your analytics, but they also introduce a great deal of complexity. What used to be entirely managed by the database engine is now a composition of multiple systems that need to be properly configured to work in concert. In order to bring the DBA into the new era of data management the team at Upsolver added a SQL interface to their data lake platform. In this episode Upsolver CEO Ori Rafael and CTO Yoni Iny describe how they have grown their platform deliberately to allow for layering SQL on top of a robust foundation for creating and operating a data lake, how to bring more people on board to work with the data being collected, and the unique benefits that a data lake provides. This was an interesting look at the impact that the interface to your data can have on who is empowered to work with it.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nGaining a complete view of the customer journey is especially difficult in B2B companies. This is due to the number of different individuals involved and the myriad ways that they interface with the business. Dreamdata integrates data from the multitude of platforms that are used by these organizations so that they can get a comprehensive view of their customer lifecycle. In this episode Ole Dallerup explains how Dreamdata was started, how their platform is architected, and the challenges inherent to data management in the B2B space. This conversation is a useful look into how data engineering and analytics can have a direct impact on the success of the business.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe PostgreSQL database is massively popular due to its flexibility and extensive ecosystem of extensions, but it is still not the first choice for high performance analytics. Swarm64 aims to change that by adding support for advanced hardware capabilities like FPGAs and optimized usage of modern SSDs. In this episode CEO and co-founder Thomas Richter discusses his motivation for creating an extension to optimize Postgres hardware usage, the benefits of running your analytics on the same platform as your application, and how it works under the hood. If you are trying to get more performance out of your database then this episode is for you!
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThere have been several generations of platforms for managing streaming data, each with their own strengths and weaknesses, and different areas of focus. Pulsar is one of the recent entrants which has quickly gained adoption and an impressive set of capabilities. In this episode Sijie Guo discusses his motivations for spending so much of his time and energy on contributing to the project and growing the community. His most recent endeavor at StreamNative is focused on combining the capabilities of Pulsar with the cloud native movement to make it easier to build and scale real time messaging systems with built in event processing capabilities. This was a great conversation about the strengths of the Pulsar project, how it has evolved in recent years, and some of the innovative ways that it is being used. Pulsar is a well engineered and robust platform for building the core of any system that relies on durable access to easily scalable streams of data.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData management is hard at any scale, but working in the context of an enterprise organization adds even greater complexity. Infoworks is a platform built to provide a unified set of tooling for managing the full lifecycle of data in large businesses. By reducing the barrier to entry with a graphical interface for defining data transformations and analysis, it makes it easier to bring the domain experts into the process. In this interview co-founder and CTO of Infoworks Amar Arsikere explains the unique challenges faced by enterprise organizations, how the platform is architected to provide the needed flexibility and scale, and how a unified platform for data improves the outcomes of the organizations using it.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData is a critical element to every role in an organization, which is also what makes managing it so challenging. With so many different opinions about which pieces of information are most important, how it needs to be accessed, and what to do with it, many data projects are doomed to failure. In this episode Chris Bergh explains how taking an agile approach to delivering value can drive down the complexity that grows out of the varied needs of the business. Building a DataOps workflow that incorporates fast delivery of well defined projects, continuous testing, and open lines of communication is a proven path to success.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nModern applications frequently require access to real-time data, but building and maintaining the systems that make that possible is a complex and time consuming endeavor. Eventador is a managed platform designed to let you focus on using the data that you collect, without worrying about how to make it reliable. In this episode Eventador Founder and CEO Kenny Gorman describes how the platform is architected, the challenges inherent to managing reliable streams of data, the simplicity offered by a SQL interface, and the interesting projects that his customers have built on top of it. This was an interesting inside look at building a business on top of open source stream processing frameworks and how to reduce the burden on end users.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe software applications that we build for our businesses are a rich source of data, but accessing and extracting that data is often a slow and error-prone process. Rookout has built a platform to separate the data collection process from the lifecycle of your code. In this episode, CTO Liran Haimovitch discusses the benefits of shortening the iteration cycle and bringing non-engineers into the process of identifying useful data. This was a great conversation about the importance of democratizing the work of data collection.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nKnowledge graphs are a data resource that can answer questions beyond the scope of traditional data analytics. By organizing and storing data to emphasize the relationship between entities, we can discover the complex connections between multiple sources of information. In this episode John Maiden talks about how Cherre builds knowledge graphs that provide powerful insights for their customers and the engineering challenges of building a scalable graph. If you’re wondering how to extract additional business value from existing data, this episode will provide a way to expand your data resources.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nBuilding and maintaining a system that integrates and analyzes all of the data for your organization is a complex endeavor. Operating on a shoe-string budget makes it even more challenging. In this episode Tyler Colby shares his experiences working as a data professional in the non-profit sector. From managing Salesforce data models to wrangling a multitude of data sources and compliance challenges, he describes the biggest challenges that he is facing.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThere are a number of platforms available for object storage, including self-managed open source projects. But what goes on behind the scenes of the companies that run these systems at scale so you don’t have to? In this episode Will Smith shares the journey that he and his team at Linode recently completed to bring a fast and reliable S3 compatible object storage to production for your benefit. He discusses the challenges of running object storage for public usage, some of the interesting ways that it was stress tested internally, and the lessons that he learned along the way.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nCouchDB is a distributed document database built for scale and ease of operation. With a built-in synchronization protocol and a HTTP interface it has become popular as a backend for web and mobile applications. Created 15 years ago, it has accrued some technical debt which is being addressed with a refactored architecture based on FoundationDB. In this episode Adam Kocoloski shares the history of the project, how it works under the hood, and how the new design will improve the project for our new era of computation. This was an interesting conversation about the challenges of maintaining a large and mission critical project and the work being done to evolve it.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData governance is a complex endeavor, but scaling it to meet the needs of a complex or globally distributed organization requires a well considered and coherent strategy. In this episode Tim Ward describes an architecture that he has used successfully with multiple organizations to scale compliance. By treating it as a graph problem, where each hub in the network has localized control with inheritance of higher level controls it reduces overhead and provides greater flexibility. Tim provides useful examples for understanding how to adopt this approach in your own organization, including some technology recommendations for making it maintainable and scalable. If you are struggling to scale data quality controls and governance requirements then this interview will provide some useful ideas to incorporate into your roadmap.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nBuilding applications on top of unbounded event streams is a complex endeavor, requiring careful integration of multiple disparate systems that were engineered in isolation. The ksqlDB project was created to address this state of affairs by building a unified layer on top of the Kafka ecosystem for stream processing. Developers can work with the SQL constructs that they are familiar with while automatically getting the durability and reliability that Kafka offers. In this episode Michael Drogalis, product manager for ksqlDB at Confluent, explains how the system is implemented, how you can use it for building your own stream processing applications, and how it fits into the lifecycle of your data infrastructure. If you have been struggling with building services on low level streaming interfaces then give this episode a listen and try it out for yourself.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nMisaligned priorities across business units can lead to tensions that drive members of the organization to build data and analytics projects without the guidance or support of engineering or IT staff. The availability of cloud platforms and managed services makes this a viable option, but can lead to downstream challenges. In this episode Sean Knapp and Charlie Crocker share their experiences of working in and with companies that have dealt with shadow IT projects and the importance of enabling and empowering the use and exploration of data and analytics. If you have ever been frustrated by seemingly draconian policies or struggled to align everyone on your supported platform, then this episode will help you gain some perspective and set you on a path to productive collaboration.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nOne of the biggest challenges in building reliable platforms for processing event pipelines is managing the underlying infrastructure. At Snowplow Analytics the complexity is compounded by the need to manage multiple instances of their platform across customer environments. In this episode Josh Beemster, the technical operations lead at Snowplow, explains how they manage automation, deployment, monitoring, scaling, and maintenance of their streaming analytics pipeline for event data. He also shares the challenges they face in supporting multiple cloud environments and the need to integrate with existing customer systems. If you are daunted by the needs of your data infrastructure then it’s worth listening to how Josh and his team are approaching the problem.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nDesigning the structure for your data warehouse is a complex and challenging process. As businesses deal with a growing number of sources and types of information that they need to integrate, they need a data modeling strategy that provides them with flexibility and speed. Data Vault is an approach that allows for evolving a data model in place without requiring destructive transformations and massive up front design to answer valuable questions. In this episode Kent Graziano shares his journey with data vault, explains how it allows for an agile approach to data warehousing, and explains the core principles of how to use it. If you’re struggling with unwieldy dimensional models, slow moving projects, or challenges integrating new data sources then listen in on this conversation and then give data vault a try for yourself.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nEvery business collects data in some fashion, but sometimes the true value of the collected information only comes when it is combined with other data sources. Data trusts are a legal framework for allowing businesses to collaboratively pool their data. This allows the members of the trust to increase the value of their individual repositories and gain new insights which would otherwise require substantial effort in duplicating the data owned by their peers. In this episode Tom Plagge and Greg Mundy explain how the BrightHive platform serves to establish and maintain data trusts, the technical and organizational challenges they face, and the outcomes that they have witnessed. If you are curious about data sharing strategies or data collaboratives, then listen now to learn more!
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData pipelines are complicated and business critical pieces of technical infrastructure. Unfortunately they are also complex and difficult to test, leading to a significant amount of technical debt which contributes to slower iteration cycles. In this episode James Campbell describes how he helped create the Great Expectations framework to help you gain control and confidence in your data delivery workflows, the challenges of validating and monitoring the quality and accuracy of your data, and how you can use it in your own environments to improve your ability to move fast.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nBuilding a reliable data platform is a neverending task. Even if you have a process that works for you and your business there can be unexpected events that require a change in your platform architecture. In this episode the head of data for Mayvenn shares their experience migrating an existing set of streaming workflows onto the Ascend platform after their previous vendor was acquired and changed their offering. This is an interesting discussion about the ongoing maintenance and decision making required to keep your business data up to date and accurate.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe modern era of software development is identified by ubiquitous access to elastic infrastructure for computation and easy automation of deployment. This has led to a class of applications that can quickly scale to serve users worldwide. This requires a new class of data storage which can accomodate that demand without having to rearchitect your system at each level of growth. YugabyteDB is an open source database designed to support planet scale workloads with high data density and full ACID compliance. In this episode Karthik Ranganathan explains how Yugabyte is architected, their motivations for being fully open source, and how they simplify the process of scaling your application from greenfield to global.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nDatabases are useful for inspecting the current state of your application, but inspecting the history of that data can get messy without a way to track changes as they happen. Debezium is an open source platform for reliable change data capture that you can use to build supplemental systems for everything from maintaining audit trails to real-time updates of your data warehouse. In this episode Gunnar Morling and Randall Hauch explain why it got started, how it works, and some of the myriad ways that you can use it. If you have ever struggled with implementing your own change data capture pipeline, or understanding when it would be useful then this episode is for you.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nDataDog is one of the most successful companies in the space of metrics and monitoring for servers and cloud infrastructure. In order to support their customers, they need to capture, process, and analyze massive amounts of timeseries data with a high degree of uptime and reliability. Vadim Semenov works on their data engineering team and joins the podcast in this episode to discuss the challenges that he works through, the systems that DataDog has built to power their business, and how their teams are organized to allow for rapid growth and massive scale. Getting an inside look at the companies behind the services we use is always useful, and this conversation was no exception.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nTransactional databases used in applications are optimized for fast reads and writes with relatively simple queries on a small number of records. Data warehouses are optimized for batched writes and complex analytical queries. Between those use cases there are varying levels of support for fast reads on quickly changing data. To address that need more completely the team at Materialize has created an engine that allows for building queryable views of your data as it is continually updated from the stream of changes being generated by your applications. In this episode Frank McSherry, chief scientist of Materialize, explains why it was created, what use cases it enables, and how it works to provide fast queries on continually updated data.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nBuilding clean datasets with reliable and reproducible ingestion pipelines is completely useless if it’s not possible to find them and understand their provenance. The solution to discoverability and tracking of data lineage is to incorporate a metadata repository into your data platform. The metadata repository serves as a data catalog and a means of reporting on the health and status of your datasets when it is properly integrated into the rest of your tools. At WeWork they needed a system that would provide visibility into their Airflow pipelines and the outputs produced. In this episode Julien Le Dem and Willy Lulciuc explain how they built Marquez to serve that need, how it is architected, and how it compares to other options that you might be considering. Even if you already have a metadata repository this is worth a listen to learn more about the value that visibility of your data can bring to your organization.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData warehouses have gone through many transformations, from standard relational databases on powerful hardware, to column oriented storage engines, to the current generation of cloud-native analytical engines. SnowflakeDB has been leading the charge to take advantage of cloud services that simplify the separation of compute and storage. In this episode Kent Graziano, chief technical evangelist for SnowflakeDB, explains how it is differentiated from other managed platforms and traditional data warehouse engines, the features that allow you to scale your usage dynamically, and how it allows for a shift in your workflow from ETL to ELT. If you are evaluating your options for building or migrating a data platform, then this is definitely worth a listen.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe financial industry has long been driven by data, requiring a mature and robust capacity for discovering and integrating valuable sources of information. Citadel is no exception, and in this episode Michael Watson and Robert Krzyzanowski share their experiences managing and leading the data engineering teams that power the business. They shared helpful insights into some of the challenges associated with working in a regulated industry, organizing teams to deliver value rapidly and reliably, and how they approach career development for data engineers. This was a great conversation for an inside look at how to build and maintain a data driven culture.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe team at Sentry has built a platform for anyone in the world to send software errors and events. As they scaled the volume of customers and data they began running into the limitations of their initial architecture. To address the needs of their business and continue to improve their capabilities they settled on Clickhouse as the new storage and query layer to power their business. In this episode James Cunningham and Ted Kaemming describe the process of rearchitecting a production system, what they learned in the process, and some useful tips for anyone else evaluating Clickhouse.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nWith the constant evolution of technology for data management it can seem impossible to make an informed decision about whether to build a data warehouse, or a data lake, or just leave your data wherever it currently rests. What’s worse is that any time you have to migrate to a new architecture, all of your analytical code has to change too. Thankfully it’s possible to add an abstraction layer to eliminate the churn in your client code, allowing you to evolve your data platform without disrupting your downstream data users. In this episode AtScale co-founder and CTO Matthew Baird describes how the data virtualization and data engineering automation capabilities that are built into the platform free up your engineers to focus on your business needs without having to waste cycles on premature optimization. This was a great conversation about the power of abstractions and appreciating the value of increasing the efficiency of your data team.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe practice of data management is one that requires technical acumen, but there are also many policy and regulatory issues that inform and influence the design of our systems. With the introduction of legal frameworks such as the EU GDPR and California’s CCPA it is necessary to consider how to implement data protectino and data privacy principles in the technical and policy controls that govern our data platforms. In this episode Karen Heaton and Mark Sherwood-Edwards share their experience and expertise in helping organizations achieve compliance. Even if you aren’t subject to specific rules regarding data protection it is definitely worth listening to get an overview of what you should be thinking about while building and running data pipelines.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nAs data engineers the health of our pipelines is our highest priority. Unfortunately, there are countless ways that our dataflows can break or degrade that have nothing to do with the business logic or data transformations that we write and maintain. Sean Knapp founded Ascend to address the operational challenges of running a production grade and scalable Spark infrastructure, allowing data engineers to focus on the problems that power their business. In this episode he explains the technical implementation of the Ascend platform, the challenges that he has faced in the process, and how you can use it to simplify your dataflow automation. This is a great conversation to get an understanding of all of the incidental engineering that is necessary to make your data reliable.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nDespite the fact that businesses have relied on useful and accurate data to succeed for decades now, the state of the art for obtaining and maintaining that information still leaves much to be desired. In an effort to create a better abstraction for building data applications Nick Schrock created Dagster. In this episode he explains his motivation for creating a product for data management, how the programming model simplifies the work of building testable and maintainable pipelines, and his vision for the future of data programming. If you are building dataflows then Dagster is definitely worth exploring.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe scale and complexity of the systems that we build to satisfy business requirements is increasing as the available tools become more sophisticated. In order to bridge the gap between legacy infrastructure and evolving use cases it is necessary to create a unifying set of components. In this episode Dipti Borkar explains how the emerging category of data orchestration tools fills this need, some of the existing projects that fit in this space, and some of the ways that they can work together to simplify projects such as cloud migration and hybrid cloud environments. It is always useful to get a broad view of new trends in the industry and this was a helpful perspective on the need to provide mechanisms to decouple physical storage from computing capacity.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nManaging a data warehouse can be challenging, especially when trying to maintain a common set of patterns. Dataform is a platform that helps you apply engineering principles to your data transformations and table definitions, including unit testing SQL scripts, defining repeatable pipelines, and adding metadata to your warehouse to improve your team’s communication. In this episode CTO and co-founder of Dataform Lewis Hemens joins the show to explain his motivation for creating the platform and company, how it works under the covers, and how you can start using it today to get your data warehouse under control.
\nIntroduction
\nHow did you get involved in the area of data management?
\nCan you start by explaining what DataForm is and the origin story for the platform and company?
\nCan you talk through the workflow for someone using DataForm and highlight the main features that it provides?
\nWhat are some of the challenges and mistakes that are common among engineers and analysts with regard to versioning and evolving schemas and the accompanying data?
\nHow does CI/CD and change management manifest in the context of data warehouse management?
\nHow is the Dataform SDK itself implemented and how has it evolved since you first began working on it?
\nWhat was your selection process for an embedded runtime and how did you decide on javascript?
\nWhich database engines do you support and how do you reduce the maintenance burden for supporting different dialects and capabilities?
\nWhat is involved in adding support for a new backend?
\nWhen is DataForm the wrong choice?
\nWhat do you have planned for the future of DataForm?
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe process of exposing your data through a SQL interface has many possible pathways, each with their own complications and tradeoffs. One of the recent options is Rockset, a serverless platform for fast SQL analytics on semi-structured and structured data. In this episode CEO Venkat Venkataramani and SVP of Product Shruti Bhat explain the origins of Rockset, how it is architected to allow for fast and flexible SQL analytics on your data, and how their serverless platform can save you the time and effort of implementing portions of your own infrastructure.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nBuilding an end-to-end data pipeline for your machine learning projects is a complex task, made more difficult by the variety of ways that you can structure it. Kedro is a framework that provides an opinionated workflow that lets you focus on the parts that matter, so that you don’t waste time on gluing the steps together. In this episode Tom Goldenberg explains how it works, how it is being used at Quantum Black for customer projects, and how it can help you structure your own. Definitely worth a listen to gain more understanding of the benefits that a standardized process can provide.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nObject storage is quickly becoming the unifying layer for data intensive applications and analytics. Modern, cloud oriented data warehouses and data lakes both rely on the durability and ease of use that it provides. S3 from Amazon has quickly become the de-facto API for interacting with this service, so the team at MinIO have built a production grade, easy to manage storage engine that replicates that interface. In this episode Anand Babu Periasamy shares the origin story for the MinIO platform, the myriad use cases that it supports, and the challenges that they have faced in replicating the functionality of S3. He also explains the technical implementation, innovative design, and broad vision for the project.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe conventional approach to analytics involves collecting large amounts of data that can be cleaned, followed by a separate step for analysis and interpretation. Unfortunately this strategy is not viable for handling real-time, real-world use cases such as traffic management or supply chain logistics. In this episode Simon Crosby, CTO of Swim Inc., explains how the SwimOS kernel and the enterprise data fabric built on top of it enable brand new use cases for instant insights. This was an eye opening conversation about how stateful computation of data streams from edge devices can reduce cost and complexity as compared to batch oriented workflows.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe first stage in every data project is collecting information and routing it to a storage system for later analysis. For operational data this typically means collecting log messages and system metrics. Often a different tool is used for each class of data, increasing the overall complexity and number of moving parts. The engineers at Timber.io decided to build a new tool in the form of Vector that allows for processing both of these data types in a single framework that is reliable and performant. In this episode Ben Johnson and Luke Steensen explain how the project got started, how it compares to other tools in this space, and how you can get involved in making it even better.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData professionals are working in a domain that is rapidly evolving. In order to stay current we need access to deeply technical presentations that aren’t burdened by extraneous marketing. To fulfill that need Pete Soderling and his team have been running the Data Council series of conferences and meetups around the world. In this episode Pete discusses his motivation for starting these events, how they serve to bring the data community together, and the observations that he has made about the direction that we are moving. He also shares his experiences as an investor in developer oriented startups and his views on the importance of empowering engineers to launch their own companies.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData engineers are responsible for building tools and platforms to power the workflows of other members of the business. Each group of users has their own set of requirements for the way that they access and interact with those platforms depending on the insights they are trying to gather. Benn Stancil is the chief analyst at Mode Analytics and in this episode he explains the set of considerations and requirements that data analysts need in their tools and. He also explains useful patterns for collaboration between data engineers and data analysts, and what they can learn from each other.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nManaging big data projects at scale is a perennial problem, with a wide variety of solutions that have evolved over the past 20 years. One of the early entrants that predates Hadoop and has since been open sourced is the HPCC (High Performance Computing Cluster) system. Designed as a fully integrated platform to meet the needs of enterprise grade analytics it provides a solution for the full lifecycle of data at massive scale. In this episode Flavio Villanustre, VP of infrastructure and products at HPCC Systems, shares the history of the platform, how it is architected for scale and speed, and the unique solutions that it provides for enterprise grade data analytics. He also discusses the motivations for open sourcing the platform, the detailed workflow that it enables, and how you can try it for your own projects. This was an interesting view of how a well engineered product can survive massive evolutionary shifts in the industry while remaining relevant and useful.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe extract and load pattern of data replication is the most commonly needed process in data engineering workflows. Because of the myriad sources and destinations that are available, it is also among the most difficult tasks that we encounter. Fivetran is a platform that does the hard work for you and replicates information from your source systems into whichever data warehouse you use. In this episode CEO and co-founder George Fraser explains how it is built, how it got started, and the challenges that creep in at the edges when dealing with so many disparate systems that need to be made to work together. This is a great conversation to listen to for a better understanding of the challenges inherent in synchronizing your data.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData is only valuable if you use it for something, and the first step is knowing that it is available. As organizations grow and data sources proliferate it becomes difficult to keep track of everything, particularly for analysts and data scientists who are not involved with the collection and management of that information. Lyft has build the Amundsen platform to address the problem of data discovery and in this episode Tao Feng and Mark Grover explain how it works, why they built it, and how it has impacted the workflow of data professionals in their organization. If you are struggling to realize the value of your information because you don’t know what you have or where it is then give this a listen and then try out Amundsen for yourself.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe ETL pattern that has become commonplace for integrating data from multiple sources has proven useful, but complex to maintain. For a small number of sources it is a tractable problem, but as the overall complexity of the data ecosystem continues to expand it may be time to identify new ways to tame the deluge of information. In this episode Tim Ward, CEO of CluedIn, explains the idea of eventual connectivity as a new paradigm for data integration. Rather than manually defining all of the mappings ahead of time, we can rely on the power of graph databases and some strategic metadata to allow connections to occur as the data becomes available. If you are struggling to maintain a tangle of data pipelines then you might find some new ideas for reducing your workload.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe current trend in data management is to centralize the responsibilities of storing and curating the organization’s information to a data engineering team. This organizational pattern is reinforced by the architectural pattern of data lakes as a solution for managing storage and access. In this episode Zhamak Dehghani shares an alternative approach in the form of a data mesh. Rather than connecting all of your data flows to one destination, empower your individual business units to create data products that can be consumed by other teams. This was an interesting exploration of a different way to think about the relationship between how your data is produced, how it is used, and how to build a technical platform that supports the organizational needs of your business.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nSuccessful machine learning and artificial intelligence projects require large volumes of data that is properly labelled. The challenge is that most data is not clean and well annotated, requiring a scalable data labeling process. Ideally this process can be done using the tools and systems that already power your analytics, rather than sending data into a black box. In this episode Mark Sears, CEO of CloudFactory, explains how he and his team built a platform that provides valuable service to businesses and meaningful work to developing nations. He shares the lessons learned in the early years of growing the business, the strategies that have allowed them to scale and train their workforce, and the benefits of working within their customer’s existing platforms. He also shares some valuable insights into the current state of the art for machine learning in the real world.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe market for data warehouse platforms is large and varied, with options for every use case. ClickHouse is an open source, column-oriented database engine built for interactive analytics with linear scalability. In this episode Robert Hodges and Alexander Zaitsev explain how it is architected to provide these features, the various unique capabilities that it provides, and how to run it in production. It was interesting to learn about some of the custom data types and performance optimizations that are included.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nAnomaly detection is a capability that is useful in a variety of problem domains, including finance, internet of things, and systems monitoring. Scaling the volume of events that can be processed in real-time can be challenging, so Paul Brebner from Instaclustr set out to see how far he could push Kafka and Cassandra for this use case. In this interview he explains the system design that he tested, his findings for how these tools were able to work together, and how they behaved at different orders of scale. It was an interesting conversation about how he stress tested the Instaclustr managed service for benchmarking an application that has real-world utility.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nBuilding a data platform that works equally well for data engineering and data science is a task that requires familiarity with the needs of both roles. Data engineering platforms have a strong focus on stateful execution and tasks that are strictly ordered based on dependency graphs. Data science platforms provide an environment that is conducive to rapid experimentation and iteration, with data flowing directly between stages. Jeremiah Lowin has gained experience in both styles of working, leading him to be frustrated with all of the available tools. In this episode he explains his motivation for creating a new workflow engine that marries the needs of data engineers and data scientists, how it helps to smooth the handoffs between teams working on data projects, and how the design lets you focus on what you care about while it handles the failure cases for you. It is exciting to see a new generation of workflow engine that is learning from the benefits and failures of previous tools for processing your data pipelines.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nBuilding and maintaining a data lake is a choose your own adventure of tools, services, and evolving best practices. The flexibility and freedom that data lakes provide allows for generating significant value, but it can also lead to anti-patterns and inconsistent quality in your analytics. Delta Lake is an open source, opinionated framework built on top of Spark for interacting with and maintaining data lake platforms that incorporates the lessons learned at DataBricks from countless customer use cases. In this episode Michael Armbrust, the lead architect of Delta Lake, explains how the project is designed, how you can use it for building a maintainable data lake, and some useful patterns for progressively refining the data in your lake. This conversation was useful for getting a better idea of the challenges that exist in large scale data analytics, and the current state of the tradeoffs between data lakes and data warehouses in the cloud.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nBuilding a machine learning model can be difficult, but that is only half of the battle. Having a perfect model is only useful if you are able to get it into production. In this episode Stepan Pushkarev, founder of Hydrosphere, explains why deploying and maintaining machine learning projects in production is different from regular software projects and the challenges that they bring. He also describes the Hydrosphere platform, and how the different components work together to manage the full machine learning lifecycle of model deployment and retraining. This was a useful conversation to get a better understanding of the unique difficulties that exist for machine learning projects.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nBuilding an ETL pipeline can be a significant undertaking, and sometimes it needs to be rebuilt when a better option becomes available. In this episode Aaron Gibralter, director of engineering at Greenhouse, joins Raghu Murthy, founder and CEO of DataCoral, to discuss the journey that he and his team took from an in-house ETL pipeline built out of open source components onto a paid service. He explains how their original implementation was built, why they decided to migrate to a paid service, and how they made that transition. He also discusses how the abstractions provided by DataCoral allows his data scientists to remain productive without requiring dedicated data engineers. If you are either considering how to build a data pipeline or debating whether to migrate your existing ETL to a service this is definitely worth listening to for some perspective.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nSome problems in data are well defined and benefit from a ready-made set of tools. For everything else, there’s Pachyderm, the platform for data science that is built to scale. In this episode Joe Doliner, CEO and co-founder, explains how Pachyderm started as an attempt to make data provenance easier to track, how the platform is architected and used today, and examples of how the underlying principles manifest in the workflows of data engineers and data scientists as they collaborate on data projects. In addition to all of that he also shares his thoughts on their recent round of fund-raising and where the future will take them. If you are looking for a set of tools for building your data science workflows then Pachyderm is a solid choice, featuring data versioning, first class tracking of data lineage, and language agnostic data pipelines.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nIn recent years the traditional approach to building data warehouses has shifted from transforming records before loading, to transforming them afterwards. As a result, the tooling for those transformations needs to be reimagined. The data build tool (dbt) is designed to bring battle tested engineering practices to your analytics pipelines. By providing an opinionated set of best practices it simplifies collaboration and boosts confidence in your data teams. In this episode Drew Banin, creator of dbt, explains how it got started, how it is designed, and how you can start using it today to create reliable and well-tested reports in your favorite data warehouse.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe database market continues to expand, offering systems that are suited to virtually every use case. But what happens if you need something customized to your application? FoundationDB is a distributed key-value store that provides the primitives that you need to build a custom database platform. In this episode Ryan Worl explains how it is architected, how to use it for your applications, and provides examples of system design patterns that can be built on top of it. If you need a foundation for your distributed systems, then FoundationDB is definitely worth a closer look.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nKubernetes is a driving force in the renaissance around deploying and running applications. However, managing the database layer is still a separate concern. The KubeDB project was created as a way of providing a simple mechanism for running your storage system in the same platform as your application. In this episode Tamal Saha explains how the KubeDB project got started, why you might want to run your database with Kubernetes, and how to get started. He also covers some of the challenges of managing stateful services in Kubernetes and how the fast pace of the community has contributed to the evolution of KubeDB. If you are at any stage of a Kubernetes implementation, or just thinking about it, this is definitely worth a listen to get some perspective on how to leverage it for your entire application stack.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nOne of the biggest challenges for any business trying to grow and reach customers globally is how to scale their data storage. FaunaDB is a cloud native database built by the engineers behind Twitter’s infrastructure and designed to serve the needs of modern systems. Evan Weaver is the co-founder and CEO of Fauna and in this episode he explains the unique capabilities of Fauna, compares the consensus and transaction algorithm to that used in other NewSQL systems, and describes the ways that it allows for new application design patterns. One of the unique aspects of Fauna that is worth drawing attention to is the first class support for temporality that simplifies querying of historical states of the data. It is definitely worth a good look for anyone building a platform that needs a simple to manage data layer that will scale with your business.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nDatabase indexes are critical to ensure fast lookups of your data, but they are inherently tied to the database engine. Pilosa is rewriting that equation by providing a flexible, scalable, performant engine for building an index of your data to enable high-speed aggregate analysis. In this episode Seebs explains how Pilosa fits in the broader data landscape, how it is architected, and how you can start using it for your own analysis. This was an interesting exploration of a different way to look at what a database can be.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nHow much time do you spend maintaining your data pipeline? How much end user value does that provide? Raghu Murthy founded DataCoral as a way to abstract the low level details of ETL so that you can focus on the actual problem that you are trying to solve. In this episode he explains his motivation for building the DataCoral platform, how it is leveraging serverless computing, the challenges of delivering software as a service to customer environments, and the architecture that he has designed to make batch data management easier to work with. This was a fascinating conversation with someone who has spent his entire career working on simplifying complex data problems.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nAnalytics projects fail all the time, resulting in lost opportunities and wasted resources. There are a number of factors that contribute to that failure and not all of them are under our control. However, many of them are and as data engineers we can help to keep our projects on the path to success. Eugene Khazin is the CEO of PrimeTSR where he is tasked with rescuing floundering analytics efforts and ensuring that they provide value to the business. In this episode he reflects on the ways that data projects can be structured to provide a higher probability of success and utility, how data engineers can get throughout the project lifecycle, and how to salvage a failed project so that some value can be gained from the effort.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nData integration is one of the most challenging aspects of any data platform, especially as the variety of data sources and formats grow. Enterprise organizations feel this acutely due to the silos that occur naturally across business units. The CluedIn team experienced this issue first-hand in their previous roles, leading them to build a business aimed at building a managed data fabric for the enterprise. In this episode Tim Ward, CEO of CluedIn, joins me to explain how their platform is architected, how they manage the task of integrating with third-party platforms, automating entity extraction and master data management, and the work of providing multiple views of the same data for different use cases. I highly recommend listening closely to his explanation of how they manage consistency of the data that they process across different storage backends.
\nIntroduction
\nHow did you get involved in the area of data management?
\nBefore we get started, can you share your definition of what a data fabric is?
\nCan you explain what CluedIn is and share the story of how it started?
\nCan you give an overview of the system architecture that you have built and how it has evolved since you first began building it?
\nFor a new customer of CluedIn, what is involved in the onboarding process?
\nWhat are some of the most challenging aspects of data integration?
\nHow do you manage changes or breakage in the interfaces that you use for source or destination systems?
\nWhat are some of the signals that you monitor to ensure the continued healthy operation of your platform?
\nWhat are some of the most notable customer success stories that you have experienced?
\nWhat are some cases where CluedIn is not the right choice?
\nWhat do you have planned for the future of CluedIn?
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nDelivering a data analytics project on time and with accurate information is critical to the success of any business. DataOps is a set of practices to increase the probability of success by creating value early and often, and using feedback loops to keep your project on course. In this episode Chris Bergh, head chef of Data Kitchen, explains how DataOps differs from DevOps, how the industry has begun adopting DataOps, and how to adopt an agile approach to building your data platform.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nCustomer analytics is a problem domain that has given rise to its own industry. In order to gain a full understanding of what your users are doing and how best to serve them you may need to send data to multiple services, each with their own tracking code or APIs. To simplify this process and allow your non-engineering employees to gain access to the information they need to do their jobs Segment provides a single interface for capturing data and routing it to all of the places that you need it. In this interview Segment CTO and co-founder Calvin French-Owen explains how the company got started, how it manages to multiplex data streams from multiple sources to multiple destinations, and how it can simplify your work of gaining visibility into how your customers are engaging with your business.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nDeep learning is the latest class of technology that is gaining widespread interest. As data engineers we are responsible for building and managing the platforms that power these models. To help us understand what is involved, we are joined this week by Thomas Henson. In this episode he shares his experiences experimenting with deep learning, what data engineers need to know about the infrastructure and data requirements to power the models that your team is building, and how it can be used to supercharge our ETL pipelines.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nDistributed storage systems are the foundational layer of any big data stack. There are a variety of implementations which support different specialized use cases and come with associated tradeoffs. Alluxio is a distributed virtual filesystem which integrates with multiple persistent storage systems to provide a scalable, in-memory storage layer for scaling computational workloads independent of the size of your data. In this episode Bin Fan explains how he got involved with the project, how it is implemented, and the use cases that it is particularly well suited for. If your storage and compute layers are too tightly coupled and you want to scale them independently then Alluxio is the tool for the job.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nMachine learning is a class of technologies that promise to revolutionize business. Unfortunately, it can be difficult to identify and execute on ways that it can be used in large companies. Kevin Dewalt founded Prolego to help Fortune 500 companies build, launch, and maintain their first machine learning projects so that they can remain competitive in our landscape of constant change. In this episode he discusses why machine learning projects require a new set of capabilities, how to build a team from internal and external candidates, and how an example project progressed through each phase of maturity. This was a great conversation for anyone who wants to understand the benefits and tradeoffs of machine learning for their own projects and how to put it into practice.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nArchaeologists collect and create a variety of data as part of their research and exploration. Open Context is a platform for cleaning, curating, and sharing this data. In this episode Eric Kansa describes how they process, clean, and normalize the data that they host, the challenges that they face with scaling ETL processes which require domain specific knowledge, and how the information contained in connections that they expose is being used for interesting projects.
\nIntroduction
\nHow did you get involved in the area of data management?
\nI did some database and GIS work for my dissertation in archaeology, back in the late 1990’s. I got frustrated at the lack of comparative data, and I got frustrated at all the work I put into creating data that nobody would likely use. So I decided to focus my energies in research data management.
\nCan you start by describing what Open Context is and how it started?
\nOpen Context is an open access data publishing service for archaeology. It started because we need better ways of dissminating structured data and digital media than is possible with conventional articles, books and reports.
\nWhat are your protocols for determining which data sets you will work with?
\nDatasets need to come from research projects that meet the normal standards of professional conduct (laws, ethics, professional norms) articulated by archaeology’s professional societies.
\nWhat are some of the challenges unique to research data?
\nWhat are some of the unique requirements for processing, publishing, and archiving research data?
\nYou have to work on a shoe-string budget, essentially providing "public goods". Archaeologists typically don’t have much discretionary money available, and publishing and archiving data are not yet very common practices.
\nAnother issues is that it will take a long time to publish enough data to power many "meta-analyses" that draw upon many datasets. The issue is that lots of archaeological data describes very particular places and times. Because datasets can be so particularistic, finding data relevant to your interests can be hard. So, we face a monumental task in supplying enough data to satisfy many, many paricularistic interests.
\nHow much education is necessary around your content licensing for researchers who are interested in publishing their data with you?
\nWe require use of Creative Commons licenses, and greatly encourage the CC-BY license or CC-Zero (public domain) to try to keep things simple and easy to understand.
\nCan you describe the system architecture that you use for Open Context?
\nOpen Context is a Django Python application, with a Postgres database and an Apache Solr index. It’s running on Google cloud services on a Debian linux.
\nWhat is the process for cleaning and formatting the data that you host?
\nHow much domain expertise is necessary to ensure proper conversion of the source data?
\nThat’s one of the bottle necks. We have to do an ETL (extract transform load) on each dataset researchers submit for publication. Each dataset may need lots of cleaning and back and forth conversations with data creators.
\nCan you discuss the challenges that you face in maintaining a consistent ontology?
\nWhat pieces of metadata do you track for a given data set?
\nCan you speak to the average size of data sets that you manage and any approach that you use to optimize for cost of storage and processing capacity?
\nData archiving is a complicated and difficult endeavor due to issues pertaining to changing data formats and storage media, as well as repeatability of computing environments to generate and/or process them. Can you discuss the technical and procedural approaches that you take to address those challenges?
\nOnce the data is stored you expose it for public use via a set of APIs which support linked data. Can you discuss any complexities that arise from needing to identify and expose interrelations between the data sets?
\nWhat are some of the most interesting uses you have seen of the data that is hosted on Open Context?
\nWhat have been some of the most interesting/useful/challenging lessons that you have learned while working on Open Context?
\nWhat are your goals for the future of Open Context?
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nControlling access to a database is a solved problem… right? It can be straightforward for small teams and a small number of storage engines, but once either or both of those start to scale then things quickly become complex and difficult to manage. After years of running across the same issues in numerous companies and even more projects Justin McCarthy built strongDM to solve database access management for everyone. In this episode he explains how the strongDM proxy works to grant and audit access to storage systems and the benefits that it provides to engineers and team leads.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nBuilding internal expertise around big data in a large organization is a major competitive advantage. However, it can be a difficult process due to compliance needs and the need to scale globally on day one. In this episode Jesper Søgaard and Keld Antonsen share the story of starting and growing the big data group at LEGO. They discuss the challenges of being at global scale from the start, hiring and training talented engineers, prototyping and deploying new systems in the cloud, and what they have learned in the process. This is a useful conversation for engineers, managers, and leadership who are interested in building enterprise big data systems.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nThe past year has been an active one for the timeseries market. New products have been launched, more businesses have moved to streaming analytics, and the team at Timescale has been keeping busy. In this episode the TimescaleDB CEO Ajay Kulkarni and CTO Michael Freedman stop by to talk about their 1.0 release, how the use cases for timeseries data have proliferated, and how they are continuing to simplify the task of processing your time oriented events.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Checking In On The Time Series Database Market With TimescaleDB (Interview)","date_published":"2019-01-13T19:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/499265b5-0538-4257-8b80-6a61729d2708.mp3","mime_type":"audio/mpeg","size_in_bytes":30150129,"duration_in_seconds":2485}]},{"id":"podlove-2019-01-05t03:38:04+00:00-56cffdb23cf7db1","title":"Performing Fast Data Analytics Using Apache Kudu - Episode 64","url":"https://www.dataengineeringpodcast.com/apache-kudu-with-brock-noland-and-jordan-birdsell-episode-64","content_text":"Summary\n\nThe Hadoop platform is purpose built for processing large, slow moving data in long-running batch jobs. As the ecosystem around it has grown, so has the need for fast data analytics on fast moving data. To fill this need the Kudu project was created with a column oriented table format that was tuned for high volumes of writes and rapid query execution across those tables. For a perfect pairing, they made it easy to connect to the Impala SQL engine. In this episode Brock Noland and Jordan Birdsell from PhData explain how Kudu is architected, how it compares to other storage systems in the Hadoop orbit, and how to start integrating it into you analytics pipeline.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Brock Noland and Jordan Birdsell about Apache Kudu and how it is able to provide fast analytics on fast data in the Hadoop ecosystem\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Kudu is and the motivation for building it?\n\nHow does it fit into the Hadoop ecosystem?\nHow does it compare to the work being done on the Iceberg table format?\n\n\n\nWhat are some of the common application and system design patterns that Kudu supports?\nHow is Kudu architected and how has it evolved over the life of the project?\nThere are many projects in and around the Hadoop ecosystem that rely on Zookeeper as a building block for consensus. What was the reasoning for using Raft in Kudu?\nHow does the storage layer in Kudu differ from what would be found in systems like Hive or HBase?\n\n\nWhat are the implementation details in the Kudu storage interface that have had the greatest impact on its overall speed and performance?\n\n\n\nA number of the projects built for large scale data processing were not initially built with a focus on operational simplicity. What are the features of Kudu that simplify deployment and management of production infrastructure?\nWhat was the motivation for using C++ as the language target for Kudu?\n\n\nIf you were to start the project over today what would you do differently?\n\n\n\nWhat are some situations where you would advise against using Kudu?\nWhat have you found to be the most interesting/unexpected/challenging lessons learned in the process of building and maintaining Kudu?\nWhat are you most excited about for the future of Kudu?\n\n\nContact Info\n\n\nBrock\n\nLinkedIn\n@brocknoland on Twitter\n\n\n\nJordan\n\n\nLinkedIn\n@jordanbirdsell\njbirdsell on GitHub\n\n\n\nPhData\n\n\nWebsite\nphdata on GitHub\n@phdatainc on Twitter\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nKudu\nPhData\nGetting Started with Apache Kudu\nThomson Reuters\nHadoop\nOracle Exadata\nSlowly Changing Dimensions\nHDFS\nS3\nAzure Blob Storage\nState Farm\nStanly Black & Decker\nETL (Extract, Transform, Load)\nParquet\n\nPodcast Episode\n\n\n\nORC\nHBase\nSpark\n\n\nPodcast Episode\n\n\n\nImpala\nNetflix Iceberg\n\n\nPodcast Episode\n\n\n\nHive ACID\nIOT (Internet Of Things)\nStreamsets\nNiFi\n\n\nPodcast Episode\n\n\n\nKafka Connect\nMoore’s Law\n3D XPoint\nRaft Consensus Algorithm\nSTONITH (Shoot The Other Node In The Head)\nYarn\nCython\n\n\nPodcast.__init__ Episode\n\n\n\nPandas\n\n\nPodcast.__init__ Episode\n\n\n\nCloudera Manager\nApache Sentry\nCollibra\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"The Hadoop platform is purpose built for processing large, slow moving data in long-running batch jobs. As the ecosystem around it has grown, so has the need for fast data analytics on fast moving data. To fill this need the Kudu project was created with a column oriented table format that was tuned for high volumes of writes and rapid query execution across those tables. For a perfect pairing, they made it easy to connect to the Impala SQL engine. In this episode Brock Noland and Jordan Birdsell from PhData explain how Kudu is architected, how it compares to other storage systems in the Hadoop orbit, and how to start integrating it into you analytics pipeline.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Bringing Fast Data To The Hadoop Ecosystem With Kudu (Interview)","date_published":"2019-01-06T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/3cc0f209-42b2-4537-bd13-84f17db97869.mp3","mime_type":"audio/mpeg","size_in_bytes":34223144,"duration_in_seconds":3046}]},{"id":"podlove-2018-12-31t13:08:40+00:00-626851842c3eba3","title":"Simplifying Continuous Data Processing Using Stream Native Storage In Pravega with Tom Kaitchuck - Episode 63","url":"https://www.dataengineeringpodcast.com/pravega-with-tom-kaitchuck-episode-63","content_text":"Summary\n\nAs more companies and organizations are working to gain a real-time view of their business, they are increasingly turning to stream processing technologies to fullfill that need. However, the storage requirements for continuous, unbounded streams of data are markedly different than that of batch oriented workloads. To address this shortcoming the team at Dell EMC has created the open source Pravega project. In this episode Tom Kaitchuk explains how Pravega simplifies storage and processing of data streams, how it integrates with processing engines such as Flink, and the unique capabilities that it provides in the area of exactly once processing and transactions. And if you listen at approximately the half-way mark, you can hear as the hosts mind is blown by the possibilities of treating everything, including schema information, as a stream.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nTo help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Tom Kaitchuck about Pravega, an open source data storage platform optimized for persistent streams\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Pravega is and the story behind it?\nWhat are the use cases for Pravega and how does it fit into the data ecosystem?\n\nHow does it compare with systems such as Kafka and Pulsar for ingesting and persisting unbounded data?\n\n\n\nHow do you represent a stream on-disk?\n\n\nWhat are the benefits of using this format for persisted streams?\n\n\n\nOne of the compelling aspects of Pravega is the automatic sharding and resource allocation for variations in data patterns. Can you describe how that operates and the benefits that it provides?\nI am also intrigued by the automatic tiering of the persisted storage. How does that work and what options exist for managing the lifecycle of the data in the cluster?\nFor someone who wants to build an application on top of Pravega, what interfaces does it provide and what architectural patterns does it lend itself toward?\nWhat are some of the unique system design patterns that are made possible by Pravega?\nHow is Pravega architected internally?\nWhat is involved in integrating engines such as Spark, Flink, or Storm with Pravega?\nA common challenge for streaming systems is exactly once semantics. How does Pravega approach that problem?\n\n\nDoes it have any special capabilities for simplifying processing of out-of-order events?\n\n\n\nFor someone planning a deployment of Pravega, what is involved in building and scaling a cluster?\n\n\nWhat are some of the operational edge cases that users should be aware of?\n\n\n\nWhat are some of the most interesting, useful, or challenging experiences that you have had while building Pravega?\nWhat are some cases where you would recommend against using Pravega?\nWhat is in store for the future of Pravega?\n\n\nContact Info\n\n\ntkaitchuk on GitHub\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nPravega\nAmazon SQS (Simple Queue Service)\nAmazon Simple Workflow Service (SWF)\nAzure\nEMC\nZookeeper\n\nPodcast Episode\n\n\n\nBookkeeper\nKafka\nPulsar\n\n\nPodcast Episode\n\n\n\nRocksDB\nFlink\n\n\nPodcast Episode\n\n\n\nSpark\n\n\nPodcast Episode\n\n\n\nHeron\nLambda Architecture\nKappa Architecture\nErasure Code\nFlink Forward Conference\nCAP Theorem\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"As more companies and organizations are working to gain a real-time view of their business, they are increasingly turning to stream processing technologies to fullfill that need. However, the storage requirements for continuous, unbounded streams of data are markedly different than that of batch oriented workloads. To address this shortcoming the team at Dell EMC has created the open source Pravega project. In this episode Tom Kaitchuk explains how Pravega simplifies storage and processing of data streams, how it integrates with processing engines such as Flink, and the unique capabilities that it provides in the area of exactly once processing and transactions. And if you listen at approximately the half-way mark, you can hear as the hosts mind is blown by the possibilities of treating everything, including schema information, as a stream.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Stream-Native Storage For Unbounded Data With Pravega (Interview)","date_published":"2018-12-31T08:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/aa0efe25-0d55-4ab2-9308-356db6ec237c.mp3","mime_type":"audio/mpeg","size_in_bytes":30749877,"duration_in_seconds":2682}]},{"id":"podlove-2018-12-24t02:50:51+00:00-1acaf02e5e97af9","title":"Continuously Query Your Time-Series Data Using PipelineDB with Derek Nelson and Usman Masood - Episode 62","url":"https://www.dataengineeringpodcast.com/pipelinedb-with-derek-nelson-and-usman-masood-episode-62","content_text":"Summary\n\nProcessing high velocity time-series data in real-time is a complex challenge. The team at PipelineDB has built a continuous query engine that simplifies the task of computing aggregates across incoming streams of events. In this episode Derek Nelson and Usman Masood explain how it is architected, strategies for designing your data flows, how to scale it up and out, and edge cases to be aware of.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Usman Masood and Derek Nelson about PipelineDB, an open source continuous query engine for PostgreSQL\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what PipelineDB is and the motivation for creating it?\n\nWhat are the major use cases that it enables?\nWhat are some example applications that are uniquely well suited to the capabilities of PipelineDB?\n\n\n\nWhat are the major concepts and components that users of PipelineDB should be familiar with?\nGiven the fact that it is a plugin for PostgreSQL, what level of compatibility exists between PipelineDB and other plugins such as Timescale and Citus?\nWhat are some of the common patterns for populating data streams?\nWhat are the options for scaling PipelineDB systems, both vertically and horizontally?\n\n\nHow much elasticity does the system support in terms of changing volumes of inbound data?\nWhat are some of the limitations or edge cases that users should be aware of?\n\n\n\nGiven that inbound data is not persisted to disk, how do you guard against data loss?\n\n\nIs it possible to archive the data in a stream, unaltered, to a separate destination table or other storage location?\nCan a separate table be used as an input stream?\n\n\n\nSince the data being processed by the continuous queries is potentially unbounded, how do you approach checkpointing or windowing the data in the continuous views?\nWhat are some of the features that you have found to be the most useful which users might initially overlook?\nWhat would be involved in generating an alert or notification on an aggregate output that was in some way anomalous?\nWhat are some of the most challenging aspects of building continuous aggregates on unbounded data?\nWhat have you found to be some of the most interesting, complex, or challenging aspects of building and maintaining PipelineDB?\nWhat are some of the most interesting or unexpected ways that you have seen PipelineDB used?\nWhen is PipelineDB the wrong choice?\nWhat do you have planned for the future of PipelineDB now that you have hit the 1.0 milestone?\n\n\nContact Info\n\n\nDerek\n\nderekjn on GitHub\nLinkedIn\n\n\n\nUsman\n\n\n@usmanm on Twitter\nWebsite\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nPipelineDB\nStride\nPostgreSQL\n\nPodcast Episode\n\n\n\nAdRoll\nProbabilistic Data Structures\nTimescaleDB\n\n\n[Podcast Episode](\n\n\n\nHive\nRedshift\nKafka\nKinesis\nZeroMQ\nNanomsg\nHyperLogLog\nBloom Filter\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"Processing high velocity time-series data in real-time is a complex challenge. The team at PipelineDB has built a continuous query engine that simplifies the task of computing aggregates across incoming streams of events. In this episode Derek Nelson and Usman Masood explain how it is architected, strategies for designing your data flows, how to scale it up and out, and edge cases to be aware of.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Real-Time Analysis Of Time-Series Data In PostgreSQL With PipelineDB (Interview)","date_published":"2018-12-23T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/919c1062-8f5f-45ac-b697-cdf79fb78a16.mp3","mime_type":"audio/mpeg","size_in_bytes":41523140,"duration_in_seconds":3831}]},{"id":"podlove-2018-12-17t03:26:45+00:00-95d1d948f61e999","title":"Advice On Scaling Your Data Pipeline Alongside Your Business with Christian Heinzmann - Episode 61","url":"https://www.dataengineeringpodcast.com/advice-on-scaling-your-data-pipeline-alongside-your-business-with-christian-heinzmann-episode-61","content_text":"Summary\n\nEvery business needs a pipeline for their critical data, even if it is just pasting into a spreadsheet. As the organization grows and gains more customers, the requirements for that pipeline will change. In this episode Christian Heinzmann, Head of Data Warehousing at Grubhub, discusses the various requirements for data pipelines and how the overall system architecture evolves as more data is being processed. He also covers the changes in how the output of the pipelines are used, how that impacts the expectations for accuracy and availability, and some useful advice on build vs. buy for the components of a data platform.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Christian Heinzmann about how data pipelines evolve as your business grows\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by sharing your definition of a data pipeline?\n\nAt what point in the life of a project or organization should you start thinking about building a pipeline?\n\n\n\nIn the early stages when the scale of the data and business are still small, what are some of the design characteristics that you should be targeting for your pipeline?\n\n\nWhat metrics/use cases should you be optimizing for at this point?\n\n\n\nWhat are some of the indicators that you look for to signal that you are reaching the next order of magnitude in terms of scale?\n\n\nHow do the design requirements for a data pipeline change as you reach this stage?\nWhat are some of the challenges and complexities that begin to present themselves as you build and run your pipeline at medium scale?\n\n\n\nWhat are some of the changes that are necessary as you move to a large scale data pipeline?\nAt each level of scale it is important to minimize the impact of the ETL process on the source systems. What are some strategies that you have employed to avoid degrading the performance of the application systems?\nIn recent years there has been a shift to using data lakes as a staging ground before performing transformations. What are your thoughts on that approach?\nWhen performing transformations there is a potential for discarding information or losing fidelity. How have you worked to reduce the impact of this effect?\nTransformations of the source data can be brittle when the format or volume changes. How do you design the pipeline to be resilient to these types of changes?\nWhat are your selection criteria when determining what workflow or ETL engines to use in your pipeline?\n\n\nHow has your preference of build vs buy changed at different scales of operation and as new/different projects become available?\n\n\n\nWhat are some of the dead ends or edge cases that you have had to deal with in your current role at Grubhub?\nWhat are some of the common mistakes or overlooked aspects of building a data pipeline that you have seen?\nWhat are your plans for improving your current pipeline at Grubhub?\nWhat are some references that you recommend for anyone who is designing a new data platform?\n\n\nContact Info\n\n\n@sirchristian on Twitter\nBlog\nsirchristian on GitHub\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nScaling ETL blog post\nGrubHub\nData Warehouse\nRedshift\nSpark\n\nSpark In Action Podcast Episode\n\n\n\nHive\nAmazon EMR\nLooker\n\n\nPodcast Episode\n\n\n\nRedash\nMetabase\n\n\nPodcast Episode\n\n\n\nA Primer on Enterprise Data Curation\nPub/Sub (Publish-Subscribe Pattern)\nChange Data Capture\nJenkins\nPython\nAzkaban\nLuigi\nZendesk\nData Lineage\nAirBnB Engineering Blog\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"Every business needs a pipeline for their critical data, even if it is just pasting into a spreadsheet. As the organization grows and gains more customers, the requirements for that pipeline will change. In this episode Christian Heinzmann, Head of Data Warehousing at Grubhub, discusses the various requirements for data pipelines and how the overall system architecture evolves as more data is being processed. He also covers the changes in how the output of the pipelines are used, how that impacts the expectations for accuracy and availability, and some useful advice on build vs. buy for the components of a data platform.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"The Evolution Of ETL As A Function Of Business Growth (Interview)","date_published":"2018-12-16T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/240c56c8-369c-40bc-8459-0421b29d4c35.mp3","mime_type":"audio/mpeg","size_in_bytes":30203381,"duration_in_seconds":2362}]},{"id":"podlove-2018-12-10t03:02:40+00:00-c9d4bf6d08fc6f5","title":"Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60","url":"https://www.dataengineeringpodcast.com/putting-apache-spark-into-action-with-jean-georges-perrin-episode-60","content_text":"Summary\n\nApache Spark is a popular and widely used tool for a variety of data oriented projects. With the large array of capabilities, and the complexity of the underlying system, it can be difficult to understand how to get started using it. Jean George Perrin has been so impressed by the versatility of Spark that he is writing a book for data engineers to hit the ground running. In this episode he helps to make sense of what Spark is, how it works, and the various ways that you can use it. He also discusses what you need to know to get it deployed and keep it running in a production environment and how it fits into the overall data ecosystem.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Jean Georges Perrin, author of the upcoming Manning book Spark In Action 2nd Edition, about the ways that Spark is used and how it fits into the data landscape\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Spark is?\n\nWhat are some of the main use cases for Spark?\nWhat are some of the problems that Spark is uniquely suited to address?\nWho uses Spark?\n\n\n\nWhat are the tools offered to Spark users?\nHow does it compare to some of the other streaming frameworks such as Flink, Kafka, or Storm?\nFor someone building on top of Spark what are the main software design paradigms?\n\n\nHow does the design of an application change as you go from a local development environment to a production cluster?\n\n\n\nOnce your application is written, what is involved in deploying it to a production environment?\nWhat are some of the most useful strategies that you have seen for improving the efficiency and performance of a processing pipeline?\nWhat are some of the edge cases and architectural considerations that engineers should be considering as they begin to scale their deployments?\nWhat are some of the common ways that Spark is deployed, in terms of the cluster topology and the supporting technologies?\nWhat are the limitations of the Spark programming model?\n\n\nWhat are the cases where Spark is the wrong choice?\n\n\n\nWhat was your motivation for writing a book about Spark?\n\n\nWho is the target audience?\n\n\n\nWhat have been some of the most interesting or useful lessons that you have learned in the process of writing a book about Spark?\nWhat advice do you have for anyone who is considering or currently using Spark?\n\n\nContact Info\n\n\n@jgperrin on Twitter\nBlog\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nBook Discount\n\n\nUse the code poddataeng18 to get 40% off of all of Manning’s products at manning.com\n\n\nLinks\n\n\nApache Spark\nSpark In Action\nBook code examples in GitHub\nInformix\nInternational Informix Users Group\nMySQL\nMicrosoft SQL Server\nETL (Extract, Transform, Load)\nSpark SQL and Spark In Action‘s chapter 11\nSpark ML and Spark In Action‘s chapter 18\nSpark Streaming (structured) and Spark In Action‘s chapter 10\nSpark GraphX\nHadoop\nJupyter\n\nPodcast Interview\n\n\n\nZeppelin\nDatabricks\nIBM Watson Studio\nKafka\nFlink\n\n\nPodcast Episode\n\n\n\nAWS Kinesis\nYarn\nHDFS\nHive\nScala\nPySpark\nDAG\nSpark Catalyst\nSpark Tungsten\nSpark UDF\nAWS EMR\nMesos\nDC/OS\nKubernetes\nDataframes\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"Apache Spark is a popular and widely used tool for a variety of data oriented projects. With the large array of capabilities, and the complexity of the underlying system, it can be difficult to understand how to get started using it. Jean George Perrin has been so impressed by the versatility of Spark that he is writing a book for data engineers to hit the ground running. In this episode he helps to make sense of what Spark is, how it works, and the various ways that you can use it. He also discusses what you need to know to get it deployed and keep it running in a production environment and how it fits into the overall data ecosystem.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Tackling Apache Spark From The Data Engineer's Perspective (Interview)","date_published":"2018-12-09T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/23a22282-ba13-48f2-b655-a919de993dbc.mp3","mime_type":"audio/mpeg","size_in_bytes":41056849,"duration_in_seconds":3031}]},{"id":"podlove-2018-12-03t03:03:53+00:00-94eaf4cf8c919f9","title":"Apache Zookeeper As A Building Block For Distributed Systems with Patrick Hunt - Episode 59","url":"https://www.dataengineeringpodcast.com/apache-zookeeper-with-patrick-hunt-episode-59","content_text":"\nSummary\nDistributed systems are complex to build and operate, and there are certain primitives that are common to a majority of them. Rather then re-implement the same capabilities every time, many projects build on top of Apache Zookeeper. In this episode Patrick Hunt explains how the Apache Zookeeper project was started, how it functions, and how it is used as a building block for other distributed systems. He also explains the operational considerations for running your own cluster, how it compares to more recent entrants such as Consul and EtcD, and what is in store for the future.\nPreamble\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Patrick Hunt about Apache Zookeeper and how it is used as a building block for distributed systems\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Zookeeper is and how the project got started?\n\nWhat are the main motivations for using a centralized coordination service for distributed systems?\n\n\nWhat are the distributed systems primitives that are built into Zookeeper?\n\nWhat are some of the higher-order capabilities that Zookeeper provides to users who are building distributed systems on top of Zookeeper?\nWhat are some of the types of system level features that application developers will need which aren’t provided by Zookeeper?\n\n\nCan you discuss how Zookeeper is architected and how that design has evolved over time?\n\nWhat have you found to be some of the most complicated or difficult aspects of building and maintaining Zookeeper?\n\n\nWhat are the scaling factors for Zookeeper?\n\nWhat are the edge cases that users should be aware of?\nWhere does it fall on the axes of the CAP theorem?\n\n\nWhat are the main failure modes for Zookeeper?\n\nHow much of the recovery logic is left up to the end user of the Zookeeper cluster?\n\n\nSince there are a number of projects that rely on Zookeeper, many of which are likely to be run in the same environment (e.g. Kafka and Flink), what would be involved in sharing a single Zookeeper cluster among those multiple services?\nIn recent years we have seen projects such as EtcD which is used by Kubernetes, and Consul. How does Zookeeper compare with those projects?\n\nWhat are some of the cases where Zookeeper is the wrong choice?\n\n\nHow have the needs of distributed systems engineers changed since you first began working on Zookeeper?\nIf you were to start the project over today, what would you do differently?\n\nWould you still use Java?\n\n\nWhat are some of the most interesting or unexpected ways that you have seen Zookeeper used?\nWhat do you have planned for the future of Zookeeper?\n\nContact Info\n\n@phunt on Twitter\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nZookeeper\nCloudera\nGoogle Chubby\nSourceforge\nHBase\nHigh Availability\nFallacies of distributed computing\nFalsehoods programmers believe about networking\nConsul\nEtcD\nApache Curator\nRaft Consensus Algorithm\nZookeeper Atomic Broadcast\nSSD Write Cliff\nApache Kafka\nApache Flink\n\nPodcast Episode\n\n\nHDFS\nKubernetes\nNetty\nProtocol Buffers\nAvro\nRust\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"Distributed systems are complex to build and operate, and there are certain primitives that are common to a majority of them. Rather then re-implement the same capabilities every time, many projects build on top of Apache Zookeeper. In this episode Patrick Hunt explains how the Apache Zookeeper project was started, how it functions, and how it is used as a building block for other distributed systems. He also explains the operational considerations for running your own cluster, how it compares to more recent entrants such as Consul and EtcD, and what is in store for the future.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\nWhen your data lives in multiple locations, belonging to at least as many applications, it is exceedingly difficult to ask complex questions of it. The default way to manage this situation is by crafting pipelines that will extract the data from source systems and load it into a data lake or data warehouse. In order to make this situation more manageable and allow everyone in the business to gain value from the data the folks at Dremio built a self service data platform. In this episode Tomer Shiran, CEO and co-founder of Dremio, explains how it fits into the modern data landscape, how it works under the hood, and how you can start using it today to make your life easier.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Building The Dremio Open Source Data-as-a-Service Platform (Interview)","date_published":"2018-11-25T21:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/025a776c-ba80-4854-a29d-d50e1a5df4f8.mp3","mime_type":"audio/mpeg","size_in_bytes":30125935,"duration_in_seconds":2358}]},{"id":"podlove-2018-11-19t00:09:52+00:00-61f515f3979b867","title":"Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57","url":"https://www.dataengineeringpodcast.com/apache-flink-with-fabian-hueske-episode-57","content_text":"Summary\n\nModern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true stream processing engine with an impressive set of capabilities for stateful computation at scale. In this episode Fabian Hueske, one of the original authors, explains how Flink is architected, how it is being used to power some of the world’s largest businesses, where it sits in the lanscape of stream processing tools, and how you can start using it today.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Fabian Hueske, co-author of the upcoming O’Reilly book Stream Processing With Apache Flink, about his work on Apache Flink, the stateful streaming engine\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what Flink is and how the project got started?\nWhat are some of the primary ways that Flink is used?\nHow does Flink compare to other streaming engines such as Spark, Kafka, Pulsar, and Storm?\n\nWhat are some use cases that Flink is uniquely qualified to handle?\n\n\n\nWhere does Flink fit into the current data landscape?\nHow is Flink architected?\n\n\nHow has that architecture evolved?\nAre there any aspects of the current design that you would do differently if you started over today?\n\n\n\nHow does scaling work in a Flink deployment?\n\n\nWhat are the scaling limits?\nWhat are some of the failure modes that users should be aware of?\n\n\n\nHow is the statefulness of a cluster managed?\n\n\nWhat are the mechanisms for managing conflicts?\nWhat are the limiting factors for the volume of state that can be practically handled in a cluster and for a given purpose?\nCan state be shared across processes or tasks within a Flink cluster?\n\n\n\nWhat are the comparative challenges of working with bounded vs unbounded streams of data?\nHow do you handle out of order events in Flink, especially as the delay for a given event increases?\nFor someone who is using Flink in their environment, what are the primary means of interacting with and developing on top of it?\nWhat are some of the most challenging or complicated aspects of building and maintaining Flink?\nWhat are some of the most interesting or unexpected ways that you have seen Flink used?\nWhat are some of the improvements or new features that are planned for the future of Flink?\nWhat are some features or use cases that you are explicitly not planning to support?\nFor people who participate in the training sessions that you offer through Data Artisans, what are some of the concepts that they are challenged by?\n\n\nWhat do they find most interesting or exciting?\n\n\n\n\n\nContact Info\n\n\nLinkedIn\n@fhueske on Twitter\nfhueske on GitHub\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nFlink\nData Artisans\nIBM\nDB2\nTechnische Universität Berlin\nHadoop\nRelational Database\nGoogle Cloud Dataflow\nSpark\nCascading\nJava\nRocksDB\nFlink Checkpoints\nFlink Savepoints\nKafka\nPulsar\nStorm\nScala\nLINQ (Language INtegrated Query)\nSQL\nBackpressure\nWatermarks\nHDFS\nS3\nAvro\nJSON\nHive Metastore\nDell EMC\nPravega\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"Modern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true stream processing engine with an impressive set of capabilities for stateful computation at scale. In this episode Fabian Hueske, one of the original authors, explains how Flink is architected, how it is being used to power some of the world’s largest businesses, where it sits in the lanscape of stream processing tools, and how you can start using it today.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Scalable and Stateful Streaming Data With Apache Flink (Interview)","date_published":"2018-11-18T19:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/8529d7fa-4286-4292-ae22-d2b09cce42d5.mp3","mime_type":"audio/mpeg","size_in_bytes":39909768,"duration_in_seconds":2881}]},{"id":"podlove-2018-11-11t20:56:21+00:00-77d9d6217f217c6","title":"How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56","url":"https://www.dataengineeringpodcast.com/upsolver-with-yoni-iny-episode-56","content_text":"Summary\n\nA data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-consuming effort, requiring specialized knowledge and diverting resources from your primary business. In this episode Yoni Iny, CTO of Upsolver, discusses the various components that are necessary for a successful data lake project, how the Upsolver platform is architected, and how modern data lakes can benefit your organization.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Yoni Iny about Upsolver, a data lake platform that lets developers integrate and analyze streaming data with ease\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what Upsolver is and how it got started?\n\nWhat are your goals for the platform?\n\n\n\nThere are a lot of opinions on both sides of the data lake argument. When is it the right choice for a data platform?\n\n\nWhat are the shortcomings of a data lake architecture?\n\n\n\nHow is Upsolver architected?\n\n\nHow has that architecture changed over time?\nHow do you manage schema validation for incoming data?\nWhat would you do differently if you were to start over today?\n\n\n\nWhat are the biggest challenges at each of the major stages of the data lake?\nWhat is the workflow for a user of Upsolver and how does it compare to a self-managed data lake?\nWhen is Upsolver the wrong choice for an organization considering implementation of a data platform?\nIs there a particular scale or level of data maturity for an organization at which they would be better served by moving management of their data lake in house?\nWhat features or improvements do you have planned for the future of Upsolver?\n\n\nContact Info\n\n\nYoni\n\nyoniiny on GitHub\nLinkedIn\n\n\n\nUpsolver\n\n\nWebsite\n@upsolver on Twitter\nLinkedIn\nFacebook\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nUpsolver\nData Lake\nIsraeli Army\nData Warehouse\nData Engineering Podcast Episode About Data Curation\nThree Vs\nKafka\nSpark\nPresto\nDrill\nSpot Instances\nObject Storage\nCassandra\nRedis\nLatency\nAvro\nParquet\nORC\nData Engineering Podcast Episode About Data Serialization Formats\nSSTables\nRun Length Encoding\nCSV (Comma Separated Values)\nProtocol Buffers\nKinesis\nETL\nDevOps\nPrometheus\nCloudwatch\nDataDog\nInfluxDB\nSQL\nPandas\nConfluent\nKSQL\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"A data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-consuming effort, requiring specialized knowledge and diverting resources from your primary business. In this episode Yoni Iny, CTO of Upsolver, discusses the various components that are necessary for a successful data lake project, how the Upsolver platform is architected, and how modern data lakes can benefit your organization.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Building A Data Lake Platform In The Cloud At Upsolver (Interview)","date_published":"2018-11-11T16:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/74c7daab-0d26-4b70-a18e-f3b19f418ab1.mp3","mime_type":"audio/mpeg","size_in_bytes":29082977,"duration_in_seconds":3110}]},{"id":"podlove-2018-11-05t01:42:46+00:00-a2201e965b3e139","title":"Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55","url":"https://www.dataengineeringpodcast.com/looker-with-daniel-mintz-episode-55","content_text":"Summary\n\nBusiness intelligence is a necessity for any organization that wants to be able to make informed decisions based on the data that they collect. Unfortunately, it is common for different portions of the business to build their reports with different assumptions, leading to conflicting views and poor choices. Looker is a modern tool for building and sharing reports that makes it easy to get everyone on the same page. In this episode Daniel Mintz explains how the product is architected, the features that make it easy for any business user to access and explore their reports, and how you can use it for your organization today.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Daniel Mintz about Looker, a a modern data platform that can serve the data needs of an entire company\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what Looker is and the problem that it is aiming to solve?\n\nHow do you define business intelligence?\n\n\n\nHow is Looker unique from other approaches to business intelligence in the enterprise?\n\n\nHow does it compare to open source platforms for BI?\n\n\n\nCan you describe the technical infrastructure that supports Looker?\nGiven that you are connecting to the customer’s data store, how do you ensure sufficient security?\nFor someone who is using Looker, what does their workflow look like?\n\n\nHow does that change for different user roles (e.g. data engineer vs sales management)\n\n\n\nWhat are the scaling factors for Looker, both in terms of volume of data for reporting from, and for user concurrency?\nWhat are the most challenging aspects of building a business intelligence tool and company in the modern data ecosystem?\n\n\nWhat are the portions of the Looker architecture that you would do differently if you were to start over today?\n\n\n\nWhat are some of the most interesting or unusual uses of Looker that you have seen?\nWhat is in store for the future of Looker?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nLooker\nUpworthy\nMoveOn.org\nLookML\nSQL\nBusiness Intelligence\nData Warehouse\nLinux\nHadoop\nBigQuery\nSnowflake\nRedshift\nDB2\nPostGres\nETL (Extract, Transform, Load)\nELT (Extract, Load, Transform)\nAirflow\nLuigi\nNiFi\nData Curation Episode\nPresto\nHive\nAthena\nDRY (Don’t Repeat Yourself)\nLooker Action Hub\nSalesforce\nMarketo\nTwilio\nNetscape Navigator\nDynamic Pricing\nSurvival Analysis\nDevOps\nBigQuery ML\nSnowflake Data Sharehouse\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"Business intelligence is a necessity for any organization that wants to be able to make informed decisions based on the data that they collect. Unfortunately, it is common for different portions of the business to build their reports with different assumptions, leading to conflicting views and poor choices. Looker is a modern tool for building and sharing reports that makes it easy to get everyone on the same page. In this episode Daniel Mintz explains how the product is architected, the features that make it easy for any business user to access and explore their reports, and how you can use it for your organization today.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Easy And Powerful Self Service Business Intelligence With Looker (Interview)","date_published":"2018-11-04T20:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/4f14f000-9269-46f2-b63d-fb7d95cbba36.mp3","mime_type":"audio/mpeg","size_in_bytes":34471531,"duration_in_seconds":3484}]},{"id":"podlove-2018-10-29t01:11:31+00:00-bc8bc0c289f7e69","title":"Using Notebooks As The Unifying Layer For Data Roles At Netflix with Matthew Seal - Episode 54","url":"https://www.dataengineeringpodcast.com/using-notebooks-as-the-unifying-layer-for-data-roles-at-netflix-with-matthew-seal-episode-54","content_text":"Summary\n\nJupyter notebooks have gained popularity among data scientists as an easy way to do exploratory analysis and build interactive reports. However, this can cause difficulties when trying to move the work of the data scientist into a more standard production environment, due to the translation efforts that are necessary. At Netflix they had the crazy idea that perhaps that last step isn’t necessary, and the production workflows can just run the notebooks directly. Matthew Seal is one of the primary engineers who has been tasked with building the tools and practices that allow the various data oriented roles to unify their work around notebooks. In this episode he explains the rationale for the effort, the challenges that it has posed, the development that has been done to make it work, and the benefits that it provides to the Netflix data platform teams.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Matthew Seal about the ways that Netflix is using Jupyter notebooks to bridge the gap between data roles\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by outlining the motivation for choosing Jupyter notebooks as the core interface for your data teams?\n\nWhere are you using notebooks and where are you not?\n\n\n\nWhat is the technical infrastructure that you have built to suppport that design choice?\nWhich team was driving the effort?\n\n\nWas it difficult to get buy in across teams?\n\n\n\nHow much shared code have you been able to consolidate or reuse across teams/roles?\nHave you investigated the use of any of the other notebook platforms for similar workflows?\nWhat are some of the notebook anti-patterns that you have encountered and what conventions or tooling have you established to discourage them?\nWhat are some of the limitations of the notebook environment for the work that you are doing?\nWhat have been some of the most challenging aspects of building production workflows on top of Jupyter notebooks?\nWhat are some of the projects that are ongoing or planned for the future that you are most excited by?\n\n\nContact Info\n\n\nMatthew Seal\n\nEmail\nLinkedIn\n@codeseal on Twitter\nMSeal on GitHub\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nNetflix Notebook Blog Posts\nNteract Tooling\nOpenGov\nProject Jupyter\nZeppelin Notebooks\nPapermill\nTitus\nCommuter\nScala\nPython\nR\nEmacs\nNBDime\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"Jupyter notebooks have gained popularity among data scientists as an easy way to do exploratory analysis and build interactive reports. However, this can cause difficulties when trying to move the work of the data scientist into a more standard production environment, due to the translation efforts that are necessary. At Netflix they had the crazy idea that perhaps that last step isn’t necessary, and the production workflows can just run the notebooks directly. Matthew Seal is one of the primary engineers who has been tasked with building the tools and practices that allow the various data oriented roles to unify their work around notebooks. In this episode he explains the rationale for the effort, the challenges that it has posed, the development that has been done to make it work, and the benefits that it provides to the Netflix data platform teams.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"How Netflix Is Using Jupyter Notebooks In Production (Interview)","date_published":"2018-10-28T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/b1d91aff-fde9-413b-81b4-904e66268255.mp3","mime_type":"audio/mpeg","size_in_bytes":32126228,"duration_in_seconds":2454}]},{"id":"podlove-2018-10-22t01:49:12+00:00-7b6367ae353b3ac","title":"Of Checklists, Ethics, and Data with Emily Miller and Peter Bull (Cross Post from Podcast.__init__) - Episode 53","url":"https://www.dataengineeringpodcast.com/deon-with-emily-miller-and-peter-bull-episode-53","content_text":"Summary\n\nAs data science becomes more widespread and has a bigger impact on the lives of people, it is important that those projects and products are built with a conscious consideration of ethics. Keeping ethical principles in mind throughout the lifecycle of a data project helps to reduce the overall effort of preventing negative outcomes from the use of the final product. Emily Miller and Peter Bull of Driven Data have created Deon to improve the communication and conversation around ethics among and between data teams. It is a Python project that generates a checklist of common concerns for data oriented projects at the various stages of the lifecycle where they should be considered. In this episode they discuss their motivation for creating the project, the challenges and benefits of maintaining such a checklist, and how you can start using it today.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nThis is your host Tobias Macey and this week I am sharing an episode from my other show, Podcast.__init__, about a project from Driven Data called Deon. It is a simple tool that generates a checklist of ethical considerations for the various stages of the lifecycle for data oriented projects. This is an important topic for all of the teams involved in the management and creation of projects that leverage data. So give it a listen and if you like what you hear, be sure to check out the other episodes at pythonpodcast.com\n\n\nInterview\n\n\nIntroductions\nHow did you get introduced to Python?\nCan you start by describing what Deon is and your motivation for creating it?\nWhy a checklist, specifically? What’s the advantage of this over an oath, for example?\nWhat is unique to data science in terms of the ethical concerns, as compared to traditional software engineering?\nWhat is the typical workflow for a team that is using Deon in their projects?\nDeon ships with a default checklist but allows for customization. What are some common addendums that you have seen?\n\nHave you received pushback on any of the default items?\n\n\n\nHow does Deon simplify communication around ethics across team boundaries?\nWhat are some of the most often overlooked items?\nWhat are some of the most difficult ethical concerns to comply with for a typical data science project?\nHow has Deon helped you at Driven Data?\nWhat are the customer facing impacts of embedding a discussion of ethics in the product development process?\nSome of the items on the default checklist coincide with regulatory requirements. Are there any cases where regulation is in conflict with an ethical concern that you would like to see practiced?\nWhat are your hopes for the future of the Deon project?\n\n\nKeep In Touch\n\n\nEmily\n\nLinkedIn\nejm714 on GitHub\n\n\n\nPeter\n\n\nLinkedIn\n@pjbull on Twitter\npjbull on GitHub\n\n\n\nDriven Data\n\n\n@drivendataorg on Twitter\ndrivendataorg on GitHub\nWebsite\n\n\n\n\n\nPicks\n\n\nTobias\n\nRichard Bond Glass Art\n\n\n\nEmily\n\n\nTandem Coffee in Portland, Maine\n\n\n\nPeter\n\n\nThe Model Bakery in Saint Helena and Napa, California\n\n\n\n\n\nLinks\n\n\nDeon\nDriven Data\nInternational Development\nBrookings Institution\nStata\nEconometrics\nMetis Bootcamp\nPandas\n\nPodcast Episode\n\n\n\nC#\n.NET\nPodcast.__init__ Episode On Software Ethics\nJupyter Notebook\n\n\nPodcast Episode\n\n\n\nWord2Vec\ncookiecutter data science\nLogistic Regression\n\n\nThe intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA","content_html":"As data science becomes more widespread and has a bigger impact on the lives of people, it is important that those projects and products are built with a conscious consideration of ethics. Keeping ethical principles in mind throughout the lifecycle of a data project helps to reduce the overall effort of preventing negative outcomes from the use of the final product. Emily Miller and Peter Bull of Driven Data have created Deon to improve the communication and conversation around ethics among and between data teams. It is a Python project that generates a checklist of common concerns for data oriented projects at the various stages of the lifecycle where they should be considered. In this episode they discuss their motivation for creating the project, the challenges and benefits of maintaining such a checklist, and how you can start using it today.
\n\nThe intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
","summary":"Of Checklists, Ethics, and Data (Interview)","date_published":"2018-10-21T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/a8e4484b-9ae0-4e70-80f3-2951bb7a72e3.mp3","mime_type":"audio/mpeg","size_in_bytes":32506206,"duration_in_seconds":2732}]},{"id":"podlove-2018-10-14t23:24:07+00:00-ff1cb5a6698d52b","title":"Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52","url":"https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52","content_text":"Summary\n\nWith the growth of the Hadoop ecosystem came a proliferation of implementations for the Hive table format. Unfortunately, with no formal specification, each project works slightly different which increases the difficulty of integration across systems. The Hive format is also built with the assumptions of a local filesystem which results in painful edge cases when leveraging cloud object storage for a data lake. In this episode Ryan Blue explains how his work on the Iceberg table format specification and reference implementation has allowed Netflix to improve the performance and simplify operations for their S3 data lake. This is a highly detailed and technical exploration of how a well-engineered metadata layer can improve the speed, accuracy, and utility of large scale, multi-tenant, cloud-native data platforms.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Ryan Blue about Iceberg, a Netflix project to implement a high performance table format for batch workloads\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Iceberg is and the motivation for creating it?\n\nWas the project built with open-source in mind or was it necessary to refactor it from an internal project for public use?\n\n\n\nHow has the use of Iceberg simplified your work at Netflix?\nHow is the reference implementation architected and how has it evolved since you first began work on it?\n\n\nWhat is involved in deploying it to a user’s environment?\n\n\n\nFor someone who is interested in using Iceberg within their own environments, what is involved in integrating it with their existing query engine?\n\n\nIs there a migration path for pre-existing tables into the Iceberg format?\n\n\n\nHow is schema evolution managed at the file level?\n\n\nHow do you handle files on disk that don’t contain all of the fields specified in a table definition?\n\n\n\nOne of the complicated problems in data modeling is managing table partitions. How does Iceberg help in that regard?\nWhat are the unique challenges posed by using S3 as the basis for a data lake?\n\n\nWhat are the benefits that outweigh the difficulties?\n\n\n\nWhat have been some of the most challenging or contentious details of the specification to define?\n\n\nWhat are some things that you have explicitly left out of the specification?\n\n\n\nWhat are your long-term goals for the Iceberg specification?\n\n\nDo you anticipate the reference implementation continuing to be used and maintained?\n\n\n\n\n\nContact Info\n\n\nrdblue on GitHub\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nIceberg Reference Implementation\nIceberg Table Specification\nNetflix\nHadoop\nCloudera\nAvro\nParquet\nSpark\nS3\nHDFS\nHive\nORC\nS3mper\nGit\nMetacat\nPresto\nPig\nDDL (Data Definition Language)\nCost-Based Optimization\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"With the growth of the Hadoop ecosystem came a proliferation of implementations for the Hive table format. Unfortunately, with no formal specification, each project works slightly different which increases the difficulty of integration across systems. The Hive format is also built with the assumptions of a local filesystem which results in painful edge cases when leveraging cloud object storage for a data lake. In this episode Ryan Blue explains how his work on the Iceberg table format specification and reference implementation has allowed Netflix to improve the performance and simplify operations for their S3 data lake. This is a highly detailed and technical exploration of how a well-engineered metadata layer can improve the speed, accuracy, and utility of large scale, multi-tenant, cloud-native data platforms.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Iceberg: Improving The Utility Of Cloud-Native Big Data At Netflix (Interview)","date_published":"2018-10-14T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/72d14cd1-c100-48a6-a787-0f9826c34b5c.mp3","mime_type":"audio/mpeg","size_in_bytes":40284365,"duration_in_seconds":3225}]},{"id":"podlove-2018-10-09t12:06:14+00:00-f4ac3d0a394116c","title":"Combining Transactional And Analytical Workloads On MemSQL with Nikita Shamgunov - Episode 51","url":"https://www.dataengineeringpodcast.com/memsql-with-nikita-shamgunov-episode-51","content_text":"Summary\n\nOne of the most complex aspects of managing data for analytical workloads is moving it from a transactional database into the data warehouse. What if you didn’t have to do that at all? MemSQL is a distributed database built to support concurrent use by transactional, application oriented, and analytical, high volume, workloads on the same hardware. In this episode the CEO of MemSQL describes how the company and database got started, how it is architected for scale and speed, and how it is being used in production. This was a deep dive on how to build a successful company around a powerful platform, and how that platform simplifies operations for enterprise grade data management.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nYou work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.\nAnd the team at Metis Machine has shipped a proof-of-concept integration between the Skafos machine learning platform and the Tableau business intelligence tool, meaning that your BI team can now run the machine learning models custom built by your data science team. If you think that sounds awesome (and it is) then join the free webinar with Metis Machine on October 11th at 2 PM ET (11 AM PT). Metis Machine will walk through the architecture of the extension, demonstrate its capabilities in real time, and illustrate the use case for empowering your BI team to modify and run machine learning models directly from Tableau. Go to metismachine.com/webinars now to register.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Nikita Shamgunov about MemSQL, a newSQL database built for simultaneous transactional and analytic workloads\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by describing what MemSQL is and how the product and business first got started?\nWhat are the typical use cases for customers running MemSQL?\nWhat are the benefits of integrating the ingestion pipeline with the database engine?\n\nWhat are some typical ways that the ingest capability is leveraged by customers?\n\n\n\nHow is MemSQL architected and how has the internal design evolved from when you first started working on it?\n\n\nWhere does it fall on the axes of the CAP theorem?\n\n\nHow much processing overhead is involved in the conversion from the column oriented data stored on disk to the row oriented data stored in memory?\n\nCan you describe the lifecycle of a write transaction?\n\n\n\n\n\nCan you discuss the techniques that are used in MemSQL to optimize for speed and overall system performance?\n\n\nHow do you mitigate the impact of network latency throughout the cluster during query planning and execution?\n\n\n\nHow much of the implementation of MemSQL is using custom built code vs. open source projects?\n\nWhat are some of the common difficulties that your customers encounter when building on top of or migrating to MemSQL?\nWhat have been some of the most challenging aspects of building and growing the technical and business implementation of MemSQL?\nWhen is MemSQL the wrong choice for a data platform?\nWhat do you have planned for the future of MemSQL?\n\n\nContact Info\n\n\n@nikitashamgunov on Twitter\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nMemSQL\nNewSQL\nMicrosoft SQL Server\nSt. Petersburg University of Fine Mechanics And Optics\nC\nC++\nIn-Memory Database\nRAM (Random Access Memory)\nFlash Storage\nOracle DB\nPostgreSQL\n\nPodcast Episode\n\n\n\nKafka\nKinesis\nWealth Management\nData Warehouse\nODBC\nS3\nHDFS\nAvro\nParquet\nData Serialization Podcast Episode\nBroadcast Join\nShuffle Join\nCAP Theorem\nApache Arrow\nLZ4\nS2 Geospatial Library\nSybase\nSAP Hana\nKubernetes\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"One of the most complex aspects of managing data for analytical workloads is moving it from a transactional database into the data warehouse. What if you didn’t have to do that at all? MemSQL is a distributed database built to support concurrent use by transactional, application oriented, and analytical, high volume, workloads on the same hardware. In this episode the CEO of MemSQL describes how the company and database got started, how it is architected for scale and speed, and how it is being used in production. This was a deep dive on how to build a successful company around a powerful platform, and how that platform simplifies operations for enterprise grade data management.
\n\nHow much processing overhead is involved in the conversion from the column oriented data stored on disk to the row oriented data stored in memory?
\nCan you discuss the techniques that are used in MemSQL to optimize for speed and overall system performance?
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Fast, Scalable, and Flexible Data For Applications And Analytics On MemSQL (Interview)","date_published":"2018-10-09T08:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/d1b62f72-a74f-4d32-9274-0aa2f3f84e67.mp3","mime_type":"audio/mpeg","size_in_bytes":44516583,"duration_in_seconds":3414}]},{"id":"podlove-2018-09-30t23:41:18+00:00-e77db96e522ee0c","title":"Building A Knowledge Graph From Public Data At Enigma With Chris Groskopf - Episode 50","url":"https://www.dataengineeringpodcast.com/enigma-with-chris-groskopf-episode-50","content_text":"Summary\n\nThere are countless sources of data that are publicly available for use. Unfortunately, combining those sources and making them useful in aggregate is a time consuming and challenging process. The team at Enigma builds a knowledge graph for use in your own data projects. In this episode Chris Groskopf explains the platform they have built to consume large varieties and volumes of public data for constructing a graph for serving to their customers. He discusses the challenges they are facing to scale the platform and engineering processes, as well as the workflow that they have established to enable testing of their ETL jobs. This is a great episode to listen to for ideas on how to organize a data engineering organization.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nYou work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Chris Groskopf about Enigma and how the are using public data sources to build a knowledge graph\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you give a brief overview of what Enigma has built and what the motivation was for starting the company?\n\nHow do you define the concept of a knowledge graph?\n\n\n\nWhat are the processes involved in constructing a knowledge graph?\nCan you describe the overall architecture of your data platform and the systems that you use for storing and serving your knowledge graph?\nWhat are the most challenging or unexpected aspects of building the knowledge graph that you have encountered?\n\n\nHow do you manage the software lifecycle for your ETL code?\nWhat kinds of unit, integration, or acceptance tests do you run to ensure that you don’t introduce regressions in your processing logic?\n\n\n\nWhat are the current challenges that you are facing in building and scaling your data infrastructure?\n\n\nHow does the fact that your data sources are primarily public influence your pipeline design and what challenges does it pose?\nWhat techniques are you using to manage accuracy and consistency in the data that you ingest?\n\n\n\nCan you walk through the lifecycle of the data that you process from acquisition through to delivery to your customers?\nWhat are the weak spots in your platform that you are planning to address in upcoming projects?\n\n\nIf you were to start from scratch today, what would you have done differently?\n\n\n\nWhat are some of the most interesting or unexpected uses of your product that you have seen?\nWhat is in store for the future of Enigma?\n\n\nContact Info\n\n\nEmail\nTwitter\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nEnigma\nChicago Tribune\nNPR\nQuartz\nCSVKit\nAgate\nKnowledge Graph\nTaxonomy\nConcourse\nAirflow\nDocker\nS3\nData Lake\nParquet\n\nPodcast Episode\n\n\n\nSpark\nAWS Neptune\nAWS Batch\nMoney Laundering\nJupyter Notebook\nPapermill\nJupytext\nCauldron: The Un-Notebook\n\n\nPodcast.__init__ Episode\n\n\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"There are countless sources of data that are publicly available for use. Unfortunately, combining those sources and making them useful in aggregate is a time consuming and challenging process. The team at Enigma builds a knowledge graph for use in your own data projects. In this episode Chris Groskopf explains the platform they have built to consume large varieties and volumes of public data for constructing a graph for serving to their customers. He discusses the challenges they are facing to scale the platform and engineering processes, as well as the workflow that they have established to enable testing of their ETL jobs. This is a great episode to listen to for ideas on how to organize a data engineering organization.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"The Data Engineering Behind A Real-World Knowledge Graph (Interview)","date_published":"2018-09-30T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/ed5c6ea6-36db-4792-8e31-a95276b6ae01.mp3","mime_type":"audio/mpeg","size_in_bytes":36478263,"duration_in_seconds":3172}]},{"id":"podlove-2018-09-24t02:17:29+00:00-090c9d962100c0f","title":"A Primer On Enterprise Data Curation with Todd Walter - Episode 49","url":"https://www.dataengineeringpodcast.com/a-primer-on-enterprise-data-curation-with-todd-walter-episode-49","content_text":"Summary\nAs your data needs scale across an organization the need for a carefully considered approach to collection, storage, organization, and access becomes increasingly critical. In this episode Todd Walter shares his considerable experience in data curation to clarify the many aspects that are necessary for a successful platform for your business. Using the metaphor of a museum curator carefully managing the precious resources on display and in the vaults, he discusses the various layers of an enterprise data strategy. This includes modeling the lifecycle of your information as a pipeline from the raw, messy, loosely structured records in your data lake, through a series of transformations and ultimately to your data warehouse. He also explains which layers are useful for the different members of the business, and which pitfalls to look out for along the path to a mature and flexible data platform.\nPreamble\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nYou work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Todd Walter about data curation and how to architect your data systems to support high quality, maintainable intelligence\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nHow do you define data curation?\n\nWhat are some of the high level concerns that are encapsulated in that effort?\n\n\nHow does the size and maturity of a company affect the ways that they architect and interact with their data systems?\nCan you walk through the stages of an ideal lifecycle for data within the context of an organizations uses for it?\nWhat are some of the common mistakes that are made when designing a data architecture and how do they lead to failure?\nWhat has changed in terms of complexity and scope for data architecture and curation since you first started working in this space?\nAs “big data” became more widely discussed the common mantra was to store everything because you never know when you’ll need the data that might get thrown away. As the industry is reaching a greater degree of maturity and more regulations are implemented there has been a shift to being more considerate as to what information gets stored and for how long. What are your views on that evolution and what is your litmus test for determining which data to keep?\nIn terms of infrastructure, what are the components of a modern data architecture and how has that changed over the years?\n\nWhat is your opinion on the relative merits of a data warehouse vs a data lake and are they mutually exclusive?\n\n\nOnce an architecture has been established, how do you allow for continued evolution to prevent stagnation and eventual failure?\nETL has long been the default approach for building and enforcing data architecture, but there have been significant shifts in recent years due to the emergence of streaming systems and ELT approaches in new data warehouses. What are your thoughts on the landscape for managing data flows and migration and when to use which approach?\nWhat are some of the areas of data architecture and curation that are most often forgotten or ignored?\nWhat resources do you recommend for anyone who is interested in learning more about the landscape of data architecture and curation?\n\nContact Info\n\nLinkedIn\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nTeradata\nData Architecture\nData Curation\nData Warehouse\nChief Data Officer\nETL (Extract, Transform, Load)\nData Lake\nMetadata\nData Lineage\n\nData Provenance\n\n\nStrata Conference\nELT (Extract, Load, Transform)\nMap-Reduce\nHive\nPig\nSpark\nData Governance\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n","content_html":"As your data needs scale across an organization the need for a carefully considered approach to collection, storage, organization, and access becomes increasingly critical. In this episode Todd Walter shares his considerable experience in data curation to clarify the many aspects that are necessary for a successful platform for your business. Using the metaphor of a museum curator carefully managing the precious resources on display and in the vaults, he discusses the various layers of an enterprise data strategy. This includes modeling the lifecycle of your information as a pipeline from the raw, messy, loosely structured records in your data lake, through a series of transformations and ultimately to your data warehouse. He also explains which layers are useful for the different members of the business, and which pitfalls to look out for along the path to a mature and flexible data platform.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Every business with a website needs some way to keep track of how much traffic they are getting, where it is coming from, and which actions are being taken. The default in most cases is Google Analytics, but this can be limiting when you wish to perform detailed analysis of the captured data. To address this problem, Alex Dean co-founded Snowplow Analytics to build an open source platform that gives you total control of your website traffic data. In this episode he explains how the project and company got started, how the platform is architected, and how you can start using it today to get a clearer view of how your customers are interacting with your web and mobile applications.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Taking Ownership Of Your Web Analytics With Snowplow (Interview)","date_published":"2018-09-16T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/18448c4b-3afb-4f2a-a84d-03bcfaf9546b.mp3","mime_type":"audio/mpeg","size_in_bytes":31591472,"duration_in_seconds":2868}]},{"id":"podlove-2018-09-10t01:18:37+00:00-ff982db60725263","title":"Keep Your Data And Query It Too Using Chaos Search with Thomas Hazel and Pete Cheslock - Episode 47","url":"https://www.dataengineeringpodcast.com/chaos-search-with-pete-cheslock-and-thomas-hazel-episode-47","content_text":"Summary\n\nElasticsearch is a powerful tool for storing and analyzing data, but when using it for logs and other time oriented information it can become problematic to keep all of your history. Chaos Search was started to make it easy for you to keep all of your data and make it usable in S3, so that you can have the best of both worlds. In this episode the CTO, Thomas Hazel, and VP of Product, Pete Cheslock, describe how they have built a platform to let you keep all of your history, save money, and reduce your operational overhead. They also explain some of the types of data that you can use with Chaos Search, how to load it into S3, and when you might want to choose it over Amazon Athena for our serverless data analysis.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $/0 credit and launch a new server in under a minute.\nYou work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Pete Cheslock and Thomas Hazel about Chaos Search and their effort to bring historical depth to your Elasticsearch data\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what you have built at Chaos Search and the problems that you are trying to solve with it?\n\nWhat types of data are you focused on supporting?\nWhat are the challenges inherent to scaling an elasticsearch infrastructure to large volumes of log or metric data?\n\n\n\nIs there any need for an Elasticsearch cluster in addition to Chaos Search?\nFor someone who is using Chaos Search, what mechanisms/formats would they use for loading their data into S3?\nWhat are the benefits of implementing the Elasticsearch API on top of your data in S3 as opposed to using systems such as Presto or Drill to interact with the same information via SQL?\nGiven that the S3 API has become a de facto standard for many other object storage platforms, what would be involved in running Chaos Search on data stored outside of AWS?\nWhat mechanisms do you use to allow for such drastic space savings of indexed data in S3 versus in an Elasticsearch cluster?\nWhat is the system architecture that you have built to allow for querying terabytes of data in S3?\n\n\nWhat are the biggest contributors to query latency and what have you done to mitigate them?\n\n\n\nWhat are the options for access control when running queries against the data stored in S3?\nWhat are some of the most interesting or unexpected uses of Chaos Search and access to large amounts of historical log information that you have seen?\nWhat are your plans for the future of Chaos Search?\n\n\nContact Info\n\n\nPete Cheslock\n\n@petecheslock on Twitter\nWebsite\n\n\n\nThomas Hazel\n\n\n@thomashazel on Twitter\nLinkedIn\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nChaos Search\nAWS S3\nCassandra\nElasticsearch\n\nPodcast Interview\n\n\n\nPostgreSQL\nDistributed Systems\nInformation Theory\nLucene\nInverted Index\nKibana\nLogstash\nNVMe\nAWS KMS\nKinesis\nFluentD\nParquet\nAthena\nPresto\nDrill\nBackblaze\nOpenStack Swift\nMinio\nEMR\nDataDog\nNewRelic\nElastic Beats\nMetricbeat\nGraphite\nSnappy\nScala\nAkka\nElastalert\nTensorflow\nX-Pack\nData Lake\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"Elasticsearch is a powerful tool for storing and analyzing data, but when using it for logs and other time oriented information it can become problematic to keep all of your history. Chaos Search was started to make it easy for you to keep all of your data and make it usable in S3, so that you can have the best of both worlds. In this episode the CTO, Thomas Hazel, and VP of Product, Pete Cheslock, describe how they have built a platform to let you keep all of your history, save money, and reduce your operational overhead. They also explain some of the types of data that you can use with Chaos Search, how to load it into S3, and when you might want to choose it over Amazon Athena for our serverless data analysis.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Using Chaos Search To Make Long Term Log Storage Affordable And Useful (Interview)","date_published":"2018-09-09T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/fe0b97ef-ca51-4449-9380-bd69d4e10feb.mp3","mime_type":"audio/mpeg","size_in_bytes":30805138,"duration_in_seconds":2888}]},{"id":"podlove-2018-09-03t16:44:39+00:00-47c593fa2092739","title":"An Agile Approach To Master Data Management with Mark Marinelli - Episode 46","url":"https://www.dataengineeringpodcast.com/an-agile-approach-to-master-data-management-with-mark-marinelli-episode-46","content_text":"Summary\n\nWith the proliferation of data sources to give a more comprehensive view of the information critical to your business it is even more important to have a canonical view of the entities that you care about. Is customer number 342 in your ERP the same as Bob Smith on Twitter? Using master data management to build a data catalog helps you answer these questions reliably and simplify the process of building your business intelligence reports. In this episode the head of product at Tamr, Mark Marinelli, discusses the challenges of building a master data set, why you should have one, and some of the techniques that modern platforms and systems provide for maintaining it.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nYou work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Mark Marinelli about data mastering for modern platforms\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by establishing a definition of data mastering that we can work from?\n\nHow does the master data set get used within the overall analytical and processing systems of an organization?\n\n\n\nWhat is the traditional workflow for creating a master data set?\n\n\nWhat has changed in the current landscape of businesses and technology platforms that makes that approach impractical?\nWhat are the steps that an organization can take to evolve toward an agile approach to data mastering?\n\n\n\nAt what scale of company or project does it makes sense to start building a master data set?\nWhat are the limitations of using ML/AI to merge data sets?\nWhat are the limitations of a golden master data set in practice?\n\n\nAre there particular formats of data or types of entities that pose a greater challenge when creating a canonical format for them?\nAre there specific problem domains that are more likely to benefit from a master data set?\n\n\n\nOnce a golden master has been established, how are changes to that information handled in practice? (e.g. versioning of the data) \nWhat storage mechanisms are typically used for managing a master data set?\n\n\nAre there particular security, auditing, or access concerns that engineers should be considering when managing their golden master that goes beyond the rest of their data infrastructure?\nHow do you manage latency issues when trying to reference the same entities from multiple disparate systems?\n\n\n\nWhat have you found to be the most common stumbling blocks for a group that is implementing a master data platform?\n\n\nWhat suggestions do you have to help prevent such a project from being derailed?\n\n\n\nWhat resources do you recommend for someone looking to learn more about the theoretical and practical aspects of data mastering for their organization?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nTamr\nMulti-Dimensional Database\nMaster Data Management\nETL\nEDW (Enterprise Data Warehouse)\nWaterfall Development Method\nAgile Development Method\nDataOps\nFeature Engineering\nTableau\nQlik\nData Catalog\nPowerBI\nRDBMS (Relational Database Management System)\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"With the proliferation of data sources to give a more comprehensive view of the information critical to your business it is even more important to have a canonical view of the entities that you care about. Is customer number 342 in your ERP the same as Bob Smith on Twitter? Using master data management to build a data catalog helps you answer these questions reliably and simplify the process of building your business intelligence reports. In this episode the head of product at Tamr, Mark Marinelli, discusses the challenges of building a master data set, why you should have one, and some of the techniques that modern platforms and systems provide for maintaining it.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Building A Master Data Catalog Using Machine Learning (Interview)","date_published":"2018-09-03T14:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/4d29958c-7881-42fa-8c71-2b1e5dccb6f9.mp3","mime_type":"audio/mpeg","size_in_bytes":34971487,"duration_in_seconds":2836}]},{"id":"podlove-2018-08-27t00:16:49+00:00-a46e7e01c804851","title":"Protecting Your Data In Use At Enveil with Ellison Anne Williams - Episode 45","url":"https://www.dataengineeringpodcast.com/enveil-with-ellison-anne-williams-episode-45","content_text":"Summary\n\nThere are myriad reasons why data should be protected, and just as many ways to enforce it in tranist or at rest. Unfortunately, there is still a weak point where attackers can gain access to your unencrypted information. In this episode Ellison Anny Williams, CEO of Enveil, describes how her company uses homomorphic encryption to ensure that your analytical queries can be executed without ever having to decrypt your data.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Ellison Anne Williams about Enveil, a pioneering data security company protecting Data in Use\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data security?\nCan you start by explaining what your mission is with Enveil and how the company got started?\nOne of the core aspects of your platform is the principal of homomorphic encryption. Can you explain what that is and how you are using it?\n\nWhat are some of the challenges associated with scaling homomorphic encryption?\nWhat are some difficulties associated with working on encrypted data sets?\n\n\n\nCan you describe the underlying architecture for your data platform?\n\n\nHow has that architecture evolved from when you first began building it?\n\n\n\nWhat are some use cases that are unlocked by having a fully encrypted data platform?\nFor someone using the Enveil platform, what does their workflow look like?\nA major reason for never decrypting data is to protect it from attackers and unauthorized access. What are some of the remaining attack vectors?\nWhat are some aspects of the data being protected that still require additional consideration to prevent leaking information? (e.g. identifying individuals based on geographic data, or purchase patterns)\nWhat do you have planned for the future of Enveil?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data security today?\n\n\nLinks\n\n\nEnveil\nNSA\nGDPR\nIntellectual Property\nZero Trust\nHomomorphic Encryption\nCiphertext\nHadoop\nPII (Personally Identifiable Information)\nTLS (Transport Layer Security)\nSpark\nElasticsearch\nSide-channel attacks\nSpectre and Meltdown\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"There are myriad reasons why data should be protected, and just as many ways to enforce it in tranist or at rest. Unfortunately, there is still a weak point where attackers can gain access to your unencrypted information. In this episode Ellison Anny Williams, CEO of Enveil, describes how her company uses homomorphic encryption to ensure that your analytical queries can be executed without ever having to decrypt your data.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Using Homomorphic Encryption In Production With Enveil (Interview)","date_published":"2018-08-27T15:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/f1783c50-fe4b-4eea-957e-a6efd4645836.mp3","mime_type":"audio/mpeg","size_in_bytes":26442491,"duration_in_seconds":1481}]},{"id":"podlove-2018-08-20t03:15:13+00:00-95881c7e7bccea6","title":"Graph Databases In Production At Scale Using DGraph with Manish Jain - Episode 44","url":"https://www.dataengineeringpodcast.com/dgraph-with-manish-jain-episode-44","content_text":"Summary\n\nThe way that you store your data can have a huge impact on the ways that it can be practically used. For a substantial number of use cases, the optimal format for storing and querying that information is as a graph, however databases architected around that use case have historically been difficult to use at scale or for serving fast, distributed queries. In this episode Manish Jain explains how DGraph is overcoming those limitations, how the project got started, and how you can start using it today. He also discusses the various cases where a graph storage layer is beneficial, and when you would be better off using something else. In addition he talks about the challenges of building a distributed, consistent database and the tradeoffs that were made to make DGraph a reality.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nIf you have ever wished that you could use the same tools for versioning and distributing your data that you use for your software then you owe it to yourself to check out what the fine folks at Quilt Data have built. Quilt is an open source platform for building a sane workflow around your data that works for your whole team, including version history, metatdata management, and flexible hosting. Stop by their booth at JupyterCon in New York City on August 22nd through the 24th to say Hi and tell them that the Data Engineering Podcast sent you! After that, keep an eye on the AWS marketplace for a pre-packaged version of Quilt for Teams to deploy into your own environment and stop fighting with your data.\nPython has quickly become one of the most widely used languages by both data engineers and data scientists, letting everyone on your team understand each other more easily. However, it can be tough learning it when you’re just starting out. Luckily, there’s an easy way to get involved. Written by MIT lecturer Ana Bell and published by Manning Publications, Get Programming: Learn to code with Python is the perfect way to get started working with Python. Ana’s experience\nas a teacher of Python really shines through, as you get hands-on with the language without being drowned in confusing jargon or theory. Filled with practical examples and step-by-step lessons to take on, Get Programming is perfect for people who just want to get stuck in with Python. Get your copy of the book with a special 40% discount for Data Engineering Podcast listeners by going to dataengineeringpodcast.com/get-programming and use the discount code PodInit40!\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Manish Jain about DGraph, a low latency, high throughput, native and distributed graph database.\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat is DGraph and what motivated you to build it?\nGraph databases and graph algorithms have been part of the computing landscape for decades. What has changed in recent years to allow for the current proliferation of graph oriented storage systems?\n\nThe graph space is becoming crowded in recent years. How does DGraph compare to the current set of offerings?\n\n\n\nWhat are some of the common uses of graph storage systems?\n\n\nWhat are some potential uses that are often overlooked?\n\n\n\nThere are a few ways that graph structures and properties can be implemented, including the ability to store data in the vertices connecting nodes and the structures that can be contained within the nodes themselves. How is information represented in DGraph and what are the tradeoffs in the approach that you chose?\nHow does the query interface and data storage in DGraph differ from other options?\n\n\nWhat are your opinions on the graph query languages that have been adopted by other storages systems, such as Gremlin, Cypher, and GSQL?\n\n\n\nHow is DGraph architected and how has that architecture evolved from when it first started?\nHow do you balance the speed and agility of schema on read with the additional application complexity that is required, as opposed to schema on write?\nIn your documentation you contend that DGraph is a viable replacement for RDBMS-oriented primary storage systems. What are the switching costs for someone looking to make that transition?\nWhat are the limitations of DGraph in terms of scalability or usability?\nWhere does it fall along the axes of the CAP theorem?\nFor someone who is interested in building on top of DGraph and deploying it to production, what does their workflow and operational overhead look like?\nWhat have been the most challenging aspects of building and growing the DGraph project and community?\nWhat are some of the most interesting or unexpected uses of DGraph that you are aware of?\nWhen is DGraph the wrong choice?\nWhat are your plans for the future of DGraph?\n\n\nContact Info\n\n\n@manishrjain on Twitter\nmanishrjain on GitHub\nBlog\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nDGraph\nBadger\nGoogle Knowledge Graph\nGraph Theory\nGraph Database\nSQL\nRelational Database\nNoSQL\nOLTP (On-Line Transaction Processing)\nNeo4J\nPostgreSQL\nMySQL\nBigTable\nRecommendation System\nFraud Detection\nCustomer 360\nUsenet Express\nIPFS\nGremlin\nCypher\nGSQL\nGraphQL\nMetaWeb\nRAFT\nSpanner\nHBase\nElasticsearch\nKubernetes\nTLS (Transport Layer Security)\nJepsen Tests\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"The way that you store your data can have a huge impact on the ways that it can be practically used. For a substantial number of use cases, the optimal format for storing and querying that information is as a graph, however databases architected around that use case have historically been difficult to use at scale or for serving fast, distributed queries. In this episode Manish Jain explains how DGraph is overcoming those limitations, how the project got started, and how you can start using it today. He also discusses the various cases where a graph storage layer is beneficial, and when you would be better off using something else. In addition he talks about the challenges of building a distributed, consistent database and the tradeoffs that were made to make DGraph a reality.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"DGraph: A Fast, Distributed, Transactional Graph Database Built For Scale (Interview)","date_published":"2018-08-19T23:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/73c2b2c4-e0c4-42df-bc6d-8b40ee71b02a.mp3","mime_type":"audio/mpeg","size_in_bytes":29522736,"duration_in_seconds":2559}]},{"id":"podlove-2018-08-12t22:06:00+00:00-ed1aaaac65a74f3","title":"Putting Airflow Into Production With James Meickle - Episode 43","url":"https://www.dataengineeringpodcast.com/airflow-in-production-with-james-meickle-episode-43","content_text":"Summary\n\nThe theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning the pitfalls and best practices from someone who has gained that knowledge the hard way can save you from wasted time and frustration. In this episode James Meickle discusses his recent experience building a new installation of Airflow. He points out the strengths, design flaws, and areas of improvement for the framework. He also describes the design patterns and workflows that his team has built to allow them to use Airflow as the basis of their data science platform.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing James Meickle about his experiences building a new Airflow installation\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat was your initial project requirement?\n\nWhat tooling did you consider in addition to Airflow?\nWhat aspects of the Airflow platform led you to choose it as your implementation target?\n\n\n\nCan you describe your current deployment architecture?\n\n\nHow many engineers are involved in writing tasks for your Airflow installation?\n\n\n\nWhat resources were the most helpful while learning about Airflow design patterns?\n\n\nHow have you architected your DAGs for deployment and extensibility?\n\n\n\nWhat kinds of tests and automation have you put in place to support the ongoing stability of your deployment?\nWhat are some of the dead-ends or other pitfalls that you encountered during the course of this project?\nWhat aspects of Airflow have you found to be lacking that you would like to see improved?\nWhat did you wish someone had told you before you started work on your Airflow installation?\n\n\nIf you were to start over would you make the same choice?\nIf Airflow wasn’t available what would be your second choice?\n\n\n\nWhat are your next steps for improvements and fixes?\n\n\nContact Info\n\n\n@eronarn on Twitter\nWebsite\neronarn on GitHub\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nQuantopian\nHarvard Brain Science Initiative\nDevOps Days Boston\nGoogle Maps API\nCron\nETL (Extract, Transform, Load)\nAzkaban\nLuigi\nAWS Glue\nAirflow\nPachyderm\n\nPodcast Interview\n\n\n\nAirBnB\nPython\nYAML\nAnsible\nREST (Representational State Transfer)\nSAML (Security Assertion Markup Language)\nRBAC (Role-Based Access Control)\nMaxime Beauchemin\n\n\nMedium Blog\n\n\n\nCelery\nDask\n\n\nPodcast Interview\n\n\n\nPostgreSQL\n\n\nPodcast Interview\n\n\n\nRedis\nCloudformation\nJupyter Notebook\nQubole\nAstronomer\n\n\nPodcast Interview\n\n\n\nGunicorn\nKubernetes\nAirflow Improvement Proposals\nPython Enhancement Proposals (PEP)\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"The theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning the pitfalls and best practices from someone who has gained that knowledge the hard way can save you from wasted time and frustration. In this episode James Meickle discusses his recent experience building a new installation of Airflow. He points out the strengths, design flaws, and areas of improvement for the framework. He also describes the design patterns and workflows that his team has built to allow them to use Airflow as the basis of their data science platform.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Lessons Learned While Building A Data Science Platform With Airflow (Interview)","date_published":"2018-08-12T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/5bc29d0b-9359-4d3a-830e-c33b5d368978.mp3","mime_type":"audio/mpeg","size_in_bytes":42226599,"duration_in_seconds":2885}]},{"id":"podlove-2018-08-06t05:28:43+00:00-074e03958020c5e","title":"Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42","url":"https://www.dataengineeringpodcast.com/postgresql-with-jonathan-katz-episode-42","content_text":"Summary\n\nOne of the longest running and most popular open source database projects is PostgreSQL. Because of its extensibility and a community focus on stability it has stayed relevant as the ecosystem of development environments and data requirements have changed and evolved over its lifetime. It is difficult to capture any single facet of this database in a single conversation, let alone the entire surface area, but in this episode Jonathan Katz does an admirable job of it. He explains how Postgres started and how it has grown over the years, highlights the fundamental features that make it such a popular choice for application developers, and the ongoing efforts to add the complex features needed by the demanding workloads of today’s data layer. To cap it off he reviews some of the exciting features that the community is working on building into future releases.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nAre you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Jonathan Katz about a high level view of PostgreSQL and the unique capabilities that it offers\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nHow did you get involved in the Postgres project?\nFor anyone who hasn’t used it, can you describe what PostgreSQL is?\n\nWhere did Postgres get started and how has it evolved over the intervening years?\n\n\n\nWhat are some of the primary characteristics of Postgres that would lead someone to choose it for a given project?\n\n\nWhat are some cases where Postgres is the wrong choice?\n\n\n\nWhat are some of the common points of confusion for new users of PostGreSQL? (particularly if they have prior database experience)\nThe recent releases of Postgres have had some fairly substantial improvements and new features. How does the community manage to balance stability and reliability against the need to add new capabilities?\nWhat are the aspects of Postgres that allow it to remain relevant in the current landscape of rapid evolution at the data layer?\nAre there any plans to incorporate a distributed transaction layer into the core of the project along the lines of what has been done with Citus or CockroachDB?\nWhat is in store for the future of Postgres?\n\n\nContact Info\n\n\n@jkatz05 on Twitter\njkatz on GitHub\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nPostgreSQL\nCrunchy Data\nVenuebook\nPaperless Post\nLAMP Stack\nMySQL\nPHP\nSQL\nORDBMS\nEdgar Codd\nA Relational Model of Data for Large Shared Data Banks\nRelational Algebra\nOracle DB\nUC Berkeley\nDr. Michael Stonebraker\nIngres\nInformix\nQUEL\nANSI C\nCVS\nBSD License\nUUID\nJSON\nXML\nHStore\nPostGIS\nBTree Index\nGIN Index\nGIST Index\nKNN GIST\nSPGIST\nFull Text Search\nBRIN Index\nWAL (Write-Ahead Log)\nSQLite\nPGAdmin\nVim\nEmacs\nLinux\nOLAP (Online Analytical Processing)\nPostgres IRC\nPostgres Slack\nPostgres Conferences\nUPSERT\nPostgres Roadmap\nCockroachDB\n\nPodcast Interview\n\n\n\nCitus Data\n\n\nPodcast Interview\n\n\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"One of the longest running and most popular open source database projects is PostgreSQL. Because of its extensibility and a community focus on stability it has stayed relevant as the ecosystem of development environments and data requirements have changed and evolved over its lifetime. It is difficult to capture any single facet of this database in a single conversation, let alone the entire surface area, but in this episode Jonathan Katz does an admirable job of it. He explains how Postgres started and how it has grown over the years, highlights the fundamental features that make it such a popular choice for application developers, and the ongoing efforts to add the complex features needed by the demanding workloads of today’s data layer. To cap it off he reviews some of the exciting features that the community is working on building into future releases.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"A Whirlwind Tour Of The PostgreSQL Database (Interview)","date_published":"2018-08-06T01:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/725398d5-15ad-41bc-ba91-81b04b477869.mp3","mime_type":"audio/mpeg","size_in_bytes":60971550,"duration_in_seconds":3381}]},{"id":"podlove-2018-07-30t15:51:40+00:00-1181d5a60ee076c","title":"Mobile Data Collection And Analysis Using Ona And Canopy With Peter Lubell-Doughtie - Episode 41","url":"https://www.dataengineeringpodcast.com/canopy-and-ona-with-peter-lubell-doughtie-episode-41","content_text":"Summary\n\nWith the attention being paid to the systems that power large volumes of high velocity data it is easy to forget about the value of data collection at human scales. Ona is a company that is building technologies to support mobile data collection, analysis of the aggregated information, and user-friendly presentations. In this episode CTO Peter Lubell-Doughtie describes the architecture of the platform, the types of environments and use cases where it is being employed, and the value of small data.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nAre you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Peter Lubell-Doughtie about using Ona for collecting data and processing it with Canopy\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat is Ona and how did the company get started?\n\nWhat are some examples of the types of customers that you work with?\n\n\n\nWhat types of data do you support in your collection platform?\nWhat are some of the mechanisms that you use to ensure the accuracy of the data that is being collected by users?\nDoes your mobile collection platform allow for anyone to submit data without having to be associated with a given account or organization?\nWhat are some of the integration challenges that are unique to the types of data that get collected by mobile field workers?\nCan you describe the flow of the data from collection through to analysis?\nTo help improve the utility of the data being collected you have started building Canopy. What was the tipping point where it became worth the time and effort to start that project?\n\n\nWhat are the architectural considerations that you factored in when designing it?\nWhat have you found to be the most challenging or unexpected aspects of building an enterprise data warehouse for general users?\n\n\n\nWhat are your plans for the future of Ona and Canopy?\n\n\nContact Info\n\n\nEmail\npld on Github\nWebsite\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nOpenSRP\nOna\nCanopy\nOpen Data Kit\nEarth Institute at Columbia University\nSustainable Engineering Lab\nWHO\nBill and Melinda Gates Foundation\nXLSForms\nPostGIS\nKafka\nDruid\nSuperset\nPostgres\nAnsible\nDocker\nTerraform\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"With the attention being paid to the systems that power large volumes of high velocity data it is easy to forget about the value of data collection at human scales. Ona is a company that is building technologies to support mobile data collection, analysis of the aggregated information, and user-friendly presentations. In this episode CTO Peter Lubell-Doughtie describes the architecture of the platform, the types of environments and use cases where it is being employed, and the value of small data.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Collecting And Analysing Data At Human Scale With Ona And Canopy (Interview)","date_published":"2018-07-29T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/59df7eb3-a9e8-4f88-a85b-46e9186a9321.mp3","mime_type":"audio/mpeg","size_in_bytes":29581328,"duration_in_seconds":1754}]},{"id":"podlove-2018-07-16t02:12:22+00:00-f6728943f647dd7","title":"Ceph: A Reliable And Scalable Distributed Filesystem with Sage Weil - Episode 40","url":"https://www.dataengineeringpodcast.com/ceph-with-sage-weil-episode-40","content_text":"Summary\n\nWhen working with large volumes of data that you need to access in parallel across multiple instances you need a distributed filesystem that will scale with your workload. Even better is when that same system provides multiple paradigms for interacting with the underlying storage. Ceph is a highly available, highly scalable, and performant system that has support for object storage, block storage, and native filesystem access. In this episode Sage Weil, the creator and lead maintainer of the project, discusses how it got started, how it works, and how you can start using it on your infrastructure today. He also explains where it fits in the current landscape of distributed storage and the plans for future improvements.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nAre you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nJoin the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat\nYour host is Tobias Macey and today I’m interviewing Sage Weil about Ceph, an open source distributed file system that supports block storage, object storage, and a file system interface.\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start with an overview of what Ceph is?\n\nWhat was the motivation for starting the project?\nWhat are some of the most common use cases for Ceph?\n\n\n\nThere are a large variety of distributed file systems. How would you characterize Ceph as it compares to other options (e.g. HDFS, GlusterFS, LionFS, SeaweedFS, etc.)?\nGiven that there is no single point of failure, what mechanisms do you use to mitigate the impact of network partitions?\n\n\nWhat mechanisms are available to ensure data integrity across the cluster?\n\n\n\nHow is Ceph implemented and how has the design evolved over time?\nWhat is required to deploy and manage a Ceph cluster?\n\n\nWhat are the scaling factors for a cluster?\nWhat are the limitations?\n\n\n\nHow does Ceph handle mixed write workloads with either a high volume of small files or a smaller volume of larger files?\nIn services such as S3 the data is segregated from block storage options like EBS or EFS. Since Ceph provides all of those interfaces in one project is it possible to use each of those interfaces to the same data objects in a Ceph cluster?\nIn what situations would you advise someone against using Ceph?\nWhat are some of the most interested, unexpected, or challenging aspects of working with Ceph and the community?\nWhat are some of the plans that you have for the future of Ceph?\n\n\nContact Info\n\n\nEmail\n@liewegas on Twitter\nliewegas on GitHub\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nCeph\nRed Hat\nDreamHost\nUC Santa Cruz\nLos Alamos National Labs\nDream Objects\nOpenStack\nProxmox\nPOSIX\nGlusterFS\nHadoop\nCeph Architecture\nPaxos\nrelatime\nPrometheus\nZabbix\nKubernetes\nNVMe\nDNS-SD\nConsul\nEtcD\nDNS SRV Record\nZeroconf\nBluestore\nXFS\nErasure Coding\nNFS\nSeastar\nRook\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"When working with large volumes of data that you need to access in parallel across multiple instances you need a distributed filesystem that will scale with your workload. Even better is when that same system provides multiple paradigms for interacting with the underlying storage. Ceph is a highly available, highly scalable, and performant system that has support for object storage, block storage, and native filesystem access. In this episode Sage Weil, the creator and lead maintainer of the project, discusses how it got started, how it works, and how you can start using it on your infrastructure today. He also explains where it fits in the current landscape of distributed storage and the plans for future improvements.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Using Ceph For Highly Available, Scalable, And Flexible File Storage (Interview)","date_published":"2018-07-15T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/5e1d35b4-b3e4-4932-9743-1e808a7c4dc9.mp3","mime_type":"audio/mpeg","size_in_bytes":28307667,"duration_in_seconds":2910}]},{"id":"podlove-2018-07-08t21:41:04+00:00-1452c870dfe3f18","title":"Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto - Episode 39","url":"https://www.dataengineeringpodcast.com/nifi-with-kevin-doran-and-andy-lopresto-episode-39","content_text":"Summary\n\nData integration and routing is a constantly evolving problem and one that is fraught with edge cases and complicated requirements. The Apache NiFi project models this problem as a collection of data flows that are created through a self-service graphical interface. This framework provides a flexible platform for building a wide variety of integrations that can be managed and scaled easily to fit your particular needs. In this episode project members Kevin Doran and Andy LoPresto discuss the ways that NiFi can be used, how to start using it in your environment, and plans for future development. They also explained how it fits in the broad landscape of data tools, the interesting and challenging aspects of the project, and how to build new extensions.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nAre you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Kevin Doran and Andy LoPresto about Apache NiFi\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what NiFi is?\nWhat is the motivation for building a GUI as the primary interface for the tool when the current trend is to represent everything as code?\nHow did you get involved with the project?\n\nWhere does it sit in the broader landscape of data tools?\n\n\n\nDoes the data that is processed by NiFi flow through the servers that it is running on (á la Spark/Flink/Kafka), or does it orchestrate actions on other systems (á la Airflow/Oozie)?\n\n\nHow do you manage versioning and backup of data flows, as well as promoting them between environments?\n\n\n\nOne of the advertised features is tracking provenance for data flows that are managed by NiFi. How is that data collected and managed?\n\n\nWhat types of reporting are available across this information?\n\n\n\nWhat are some of the use cases or requirements that lend themselves well to being solved by NiFi?\n\n\nWhen is NiFi the wrong choice?\n\n\n\nWhat is involved in deploying and scaling a NiFi installation?\n\n\nWhat are some of the system/network parameters that should be considered?\nWhat are the scaling limitations?\n\n\n\nWhat have you found to be some of the most interesting, unexpected, and/or challenging aspects of building and maintaining the NiFi project and community?\nWhat do you have planned for the future of NiFi?\n\n\nContact Info\n\n\nKevin Doran\n\n@kevdoran on Twitter\nEmail\n\n\n\nAndy LoPresto\n\n\n@yolopey on Twitter\nEmail\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nNiFi\nHortonWorks DataFlow\nHortonWorks\nApache Software Foundation\nApple\nCSV\nXML\nJSON\nPerl\nPython\nInternet Scale\nAsset Management\nDocumentum\nDataFlow\nNSA (National Security Agency)\n24 (TV Show)\nTechnology Transfer Program\nAgile Software Development\nWaterfall\nSpark\nFlink\nKafka\nOozie\nLuigi\nAirflow\nFluentD\nETL (Extract, Transform, and Load)\nESB (Enterprise Service Bus)\nMiNiFi\nJava\nC++\nProvenance\nKubernetes\nApache Atlas\nData Governance\nKibana\nK-Nearest Neighbors\nDevOps\nDSL (Domain Specific Language)\nNiFi Registry\nArtifact Repository\nNexus\nNiFi CLI\nMaven Archetype\nIoT\nDocker\nBackpressure\nNiFi Wiki\nTLS (Transport Layer Security)\nMozilla TLS Observatory\nNiFi Flow Design System\nData Lineage\nGDPR (General Data Protection Regulation)\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"Data integration and routing is a constantly evolving problem and one that is fraught with edge cases and complicated requirements. The Apache NiFi project models this problem as a collection of data flows that are created through a self-service graphical interface. This framework provides a flexible platform for building a wide variety of integrations that can be managed and scaled easily to fit your particular needs. In this episode project members Kevin Doran and Andy LoPresto discuss the ways that NiFi can be used, how to start using it in your environment, and plans for future development. They also explained how it fits in the broad landscape of data tools, the interesting and challenging aspects of the project, and how to build new extensions.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Self Service Data Flows With Apache NiFi (Interview)","date_published":"2018-07-08T10:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/6eebfae3-4151-4d16-90ae-6154bbc0c5dd.mp3","mime_type":"audio/mpeg","size_in_bytes":48480013,"duration_in_seconds":3855}]},{"id":"podlove-2018-07-02t05:03:53+00:00-3b5fa253c04d24f","title":"Leveraging Human Intelligence For Better AI At Alegion With Cheryl Martin - Episode 38","url":"https://www.dataengineeringpodcast.com/alegion-with-cheryl-martin-episode-38","content_text":"Summary\n\nData is often messy or incomplete, requiring human intervention to make sense of it before being usable as input to machine learning projects. This is problematic when the volume scales beyond a handful of records. In this episode Dr. Cheryl Martin, Chief Data Scientist for Alegion, discusses the importance of properly labeled information for machine learning and artificial intelligence projects, the systems that they have built to scale the process of incorporating human intelligence in the data preparation process, and the challenges inherent to such an endeavor.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nAre you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Cheryl Martin, chief data scientist at Alegion, about data labelling at scale\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nTo start, can you explain the problem space that Alegion is targeting and how you operate?\nWhen is it necessary to include human intelligence as part of the data lifecycle for ML/AI projects?\nWhat are some of the biggest challenges associated with managing human input to data sets intended for machine usage?\nFor someone who is acting as human-intelligence provider as part of the workforce, what does their workflow look like?\n\nWhat tools and processes do you have in place to ensure the accuracy of their inputs?\nHow do you prevent bad actors from contributing data that would compromise the trained model?\n\n\n\nWhat are the limitations of crowd-sourced data labels?\n\n\nWhen is it beneficial to incorporate domain experts in the process?\n\n\n\nWhen doing data collection from various sources, how do you ensure that intellectual property rights are respected?\nHow do you determine the taxonomies to be used for structuring data sets that are collected, labeled or enriched for your customers?\n\n\nWhat kinds of metadata do you track and how is that recorded/transmitted?\n\n\n\nDo you think that human intelligence will be a necessary piece of ML/AI forever?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nAlegion\nUniversity of Texas at Austin\nCognitive Science\nLabeled Data\nMechanical Turk\nComputer Vision\nSentiment Analysis\nSpeech Recognition\nTaxonomy\nFeature Engineering\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"Data is often messy or incomplete, requiring human intervention to make sense of it before being usable as input to machine learning projects. This is problematic when the volume scales beyond a handful of records. In this episode Dr. Cheryl Martin, Chief Data Scientist for Alegion, discusses the importance of properly labeled information for machine learning and artificial intelligence projects, the systems that they have built to scale the process of incorporating human intelligence in the data preparation process, and the challenges inherent to such an endeavor.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Integrating Crowd Scale Human Intelligence In AI Projects (Interview)","date_published":"2018-07-02T01:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/112c8195-4e09-440f-92b2-7901190b2536.mp3","mime_type":"audio/mpeg","size_in_bytes":26961772,"duration_in_seconds":2773}]},{"id":"podlove-2018-06-25t02:26:24+00:00-8af85282e6849b0","title":"Package Management And Distribution For Your Data Using Quilt with Kevin Moore - Episode 37","url":"https://www.dataengineeringpodcast.com/quilt-data-with-kevin-moore-episode-37","content_text":"Summary\n\nCollaboration, distribution, and installation of software projects is largely a solved problem, but the same cannot be said of data. Every data team has a bespoke means of sharing data sets, versioning them, tracking related metadata and changes, and publishing them for use in the software systems that rely on them. The CEO and founder of Quilt Data, Kevin Moore, was sufficiently frustrated by this problem to create a platform that attempts to be the means by which data can be as collaborative and easy to work with as GitHub and your favorite programming language. In this episode he explains how the project came to be, how it works, and the many ways that you can start using it today.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nAre you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Kevin Moore about Quilt Data, a platform and tooling for packaging, distributing, and versioning data\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat is the intended use case for Quilt and how did the project get started?\nCan you step through a typical workflow of someone using Quilt?\n\nHow does that change as you go from a single user to a team of data engineers and data scientists?\n\n\n\nCan you describe the elements of what a data package consists of?\n\n\nWhat was your criteria for the file formats that you chose?\n\n\n\nHow is Quilt architected and what have been the most significant changes or evolutions since you first started?\nHow is the data registry implemented?\n\n\nWhat are the limitations or edge cases that you have run into?\nWhat optimizations have you made to accelerate synchronization of the data to and from the repository?\n\n\n\nWhat are the limitations in terms of data volume, format, or usage?\nWhat is your goal with the business that you have built around the project?\nWhat are your plans for the future of Quilt?\n\n\nContact Info\n\n\nEmail\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nQuilt Data\nGitHub\nJobs\nReproducible Data Dependencies in Jupyter\nReproducible Machine Learning with Jupyter and Quilt\nAllen Institute: Programmatic Data Access with Quilt\nQuilt Example: MissingNo\nOracle\nPandas\nJupyter\nYcombinator\nData.World\n\nPodcast Episode with CTO Bryon Jacob\n\n\n\nKaggle\nParquet\nHDF5\nArrow\nPySpark\nExcel\nScala\nBinder\nMerkle Tree\nAllen Institute for Cell Science\nFlask\nPostGreSQL\nDocker\nAirflow\nQuilt Teams\nHive\nHive Metastore\nPrestoDB\n\n\nPodcast Episode\n\n\n\nNetflix Iceberg\nKubernetes\nHelm\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"Collaboration, distribution, and installation of software projects is largely a solved problem, but the same cannot be said of data. Every data team has a bespoke means of sharing data sets, versioning them, tracking related metadata and changes, and publishing them for use in the software systems that rely on them. The CEO and founder of Quilt Data, Kevin Moore, was sufficiently frustrated by this problem to create a platform that attempts to be the means by which data can be as collaborative and easy to work with as GitHub and your favorite programming language. In this episode he explains how the project came to be, how it works, and the many ways that you can start using it today.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Quilt: The Package Manager And Repository For Your Data (Interview)","date_published":"2018-06-24T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/12fffccf-8bb9-4892-b07d-a243ea3264bd.mp3","mime_type":"audio/mpeg","size_in_bytes":22428631,"duration_in_seconds":2503}]},{"id":"podlove-2018-06-17t14:00:33+00:00-e451c6b0bdd6f51","title":"User Analytics In Depth At Heap with Dan Robinson - Episode 36","url":"https://www.dataengineeringpodcast.com/heap-with-dan-robinson-episode-36","content_text":"Summary\n\nWeb and mobile analytics are an important part of any business, and difficult to get right. The most frustrating part is when you realize that you haven’t been tracking a key interaction, having to write custom logic to add that event, and then waiting to collect data. Heap is a platform that automatically tracks every event so that you can retroactively decide which actions are important to your business and easily build reports with or without SQL. In this episode Dan Robinson, CTO of Heap, describes how they have architected their data infrastructure, how they build their tracking agents, and the data virtualization layer that enables users to define their own labels.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nFor complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Dan Robinson about Heap and their approach to collecting, storing, and analyzing large volumes of data\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by giving a brief overview of Heap?\nOne of your differentiating features is the fact that you capture every interaction on web and mobile platforms for your customers. How do you prevent the user experience from suffering as a result of network congestion, while ensuring the reliable delivery of that data?\nCan you walk through the lifecycle of a single event from source to destination and the infrastructure components that it traverses to get there?\nData collected in a user’s browser can often be messy due to various browser plugins, variations in runtime capabilities, etc. How do you ensure the integrity and accuracy of that information?\n\nWhat are some of the difficulties that you have faced in establishing a representation of events that allows for uniform processing and storage?\n\n\n\nWhat is your approach for merging and enriching event data with the information that you retrieve from your supported integrations?\n\n\nWhat challenges does that pose in your processing architecture?\n\n\n\nWhat are some of the problems that you have had to deal with to allow for processing and storing such large volumes of data?\n\n\nHow has that architecture changed or evolved over the life of the company?\nWhat are some changes that you are anticipating in the near future?\n\n\n\nCan you describe your approach for synchronizing customer data with their individual Redshift instances and the difficulties that entails?\nWhat are some of the most interesting challenges that you have faced while building the technical and business aspects of Heap?\nWhat changes have been necessary as a result of GDPR?\nWhat are your plans for the future of Heap?\n\n\nContact Info\n\n\n\n@danlovesproofs on twitter\ndan@drob.us\n@drob on github\nheapanalytics.com / @heap on twitter\nhttps://heapanalytics.com/blog/category/engineering?utm_source=rss&utm_medium=rss\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nHeap\nPalantir\nUser Analytics\nGoogle Analytics\nPiwik\nMixpanel\nHubspot\nJepsen\nChaos Engineering\nNode.js\nKafka\nScala\nCitus\nReact\nMobX\nRedshift\nHeap SQL\nBigQuery\nWebhooks\nDrip\nData Virtualization\nDNS\nPII\nSOC2\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"Web and mobile analytics are an important part of any business, and difficult to get right. The most frustrating part is when you realize that you haven’t been tracking a key interaction, having to write custom logic to add that event, and then waiting to collect data. Heap is a platform that automatically tracks every event so that you can retroactively decide which actions are important to your business and easily build reports with or without SQL. In this episode Dan Robinson, CTO of Heap, describes how they have architected their data infrastructure, how they build their tracking agents, and the data virtualization layer that enables users to define their own labels.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Heap's Data Infrastructure In Depth (Interview)","date_published":"2018-06-17T10:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/e1800a0f-88cf-46a4-8e09-4fb7a55745d8.mp3","mime_type":"audio/mpeg","size_in_bytes":34314477,"duration_in_seconds":2727}]},{"id":"podlove-2018-06-11t01:48:37+00:00-fa0e6dedf43d653","title":"CockroachDB In Depth with Peter Mattis - Episode 35","url":"https://www.dataengineeringpodcast.com/cockroachdb-with-peter-mattis-episode-35","content_text":"Summary\n\nWith the increased ease of gaining access to servers in data centers across the world has come the need for supporting globally distributed data storage. With the first wave of cloud era databases the ability to replicate information geographically came at the expense of transactions and familiar query languages. To address these shortcomings the engineers at Cockroach Labs have built a globally distributed SQL database with full ACID semantics in Cockroach DB. In this episode Peter Mattis, the co-founder and VP of Engineering at Cockroach Labs, describes the architecture that underlies the database, the challenges they have faced along the way, and the ways that you can use it in your own environments today.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nFor complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Peter Mattis about CockroachDB, the SQL database for global cloud services\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat was the motivation for creating CockroachDB and building a business around it?\nCan you describe the architecture of CockroachDB and how it supports distributed ACID transactions?\n\nWhat are some of the tradeoffs that are necessary to allow for georeplicated data with distributed transactions?\nWhat are some of the problems that you have had to work around in the RAFT protocol to provide reliable operation of the clustering mechanism?\n\n\n\nGo is an unconventional language for building a database. What are the pros and cons of that choice?\nWhat are some of the common points of confusion that users of CockroachDB have when operating or interacting with it?\n\n\nWhat are the edge cases and failure modes that users should be aware of?\n\n\n\nI know that your SQL syntax is PostGreSQL compatible, so is it possible to use existing ORMs unmodified with CockroachDB?\n\n\nWhat are some examples of extensions that are specific to CockroachDB?\n\n\n\nWhat are some of the most interesting uses of CockroachDB that you have seen?\nWhen is CockroachDB the wrong choice?\nWhat do you have planned for the future of CockroachDB?\n\n\nContact Info\n\n\nPeter\n\nLinkedIn\npetermattis on GitHub\n@petermattis on Twitter\n\n\n\nCockroach Labs\n\n\n@CockroackDB on Twitter\nWebsite\ncockroachdb on GitHub\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nCockroachDB\nCockroach Labs\nSQL\nGoogle Bigtable\nSpanner\nNoSQL\nRDBMS (Relational Database Management System)\n“Big Iron” (colloquial term for mainframe computers)\nRAFT Consensus Algorithm\nConsensus\nMVCC (Multiversion Concurrency Control)\nIsolation\nEtcd\nGDPR\nGolang\nC++\nGarbage Collection\nMetaprogramming\nRust\nStatic Linking\nDocker\nKubernetes\nCAP Theorem\nPostGreSQL\nORM (Object Relational Mapping)\nInformation Schema\nPG Catalog\nInterleaved Tables\nVertica\nSpark\nChange Data Capture\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"With the increased ease of gaining access to servers in data centers across the world has come the need for supporting globally distributed data storage. With the first wave of cloud era databases the ability to replicate information geographically came at the expense of transactions and familiar query languages. To address these shortcomings the engineers at Cockroach Labs have built a globally distributed SQL database with full ACID semantics in Cockroach DB. In this episode Peter Mattis, the co-founder and VP of Engineering at Cockroach Labs, describes the architecture that underlies the database, the challenges they have faced along the way, and the ways that you can use it in your own environments today.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"","date_published":"2018-06-10T22:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/28ec5ffc-e2e6-47e8-b2c9-6fdd5ab81faa.mp3","mime_type":"audio/mpeg","size_in_bytes":30262358,"duration_in_seconds":2621}]},{"id":"podlove-2018-06-04t03:22:01+00:00-3094d37fdff0e40","title":"ArangoDB: Fast, Scalable, and Multi-Model Data Storage with Jan Steeman and Jan Stücke - Episode 34","url":"https://www.dataengineeringpodcast.com/arangodb-fast-scalable-and-multi-model-data-storage-with-jan-steeman-and-jan-stucke-episode-34","content_text":"Summary\n\nUsing a multi-model database in your applications can greatly reduce the amount of infrastructure and complexity required. ArangoDB is a storage engine that supports documents, dey/value, and graph data formats, as well as being fast and scalable. In this episode Jan Steeman and Jan Stücke explain where Arango fits in the crowded database market, how it works under the hood, and how you can start working with it today.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Jan Stücke and Jan Steeman about ArangoDB, a multi-model distributed database for graph, document, and key/value storage.\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you give a high level description of what ArangoDB is and the motivation for creating it?\n\nWhat is the story behind the name?\n\n\n\nHow is ArangoDB constructed?\n\n\nHow does the underlying engine store the data to allow for the different ways of viewing it?\n\n\n\nWhat are some of the benefits of multi-model data storage?\n\n\nWhen does it become problematic?\n\n\n\nFor users who are accustomed to a relational engine, how do they need to adjust their approach to data modeling when working with Arango?\nHow does it compare to OrientDB?\nWhat are the options for scaling a running system?\n\n\nWhat are the limitations in terms of network architecture or data volumes?\n\n\n\nOne of the unique aspects of ArangoDB is the Foxx framework for embedding microservices in the data layer. What benefits does that provide over a three tier architecture?\n\n\nWhat mechanisms do you have in place to prevent data breaches from security vulnerabilities in the Foxx code?\nWhat are some of the most interesting or surprising uses of this functionality that you have seen?\n\n\n\nWhat are some of the most challenging technical and business aspects of building and promoting ArangoDB?\nWhat do you have planned for the future of ArangoDB?\n\n\nContact Info\n\n\nJan Steemann\n\njsteemann on GitHub\n@steemann on Twitter\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nArangoDB\nKöln\nMulti-model Database\nGraph Algorithms\nApache 2\nC++\nArangoDB Foxx\nRaft Protocol\nTarget Partners\nRocksDB\nAQL (ArangoDB Query Language)\nOrientDB\nPostGreSQL\nOrientDB Studio\nGoogle Spanner\n3-Tier Architecture\nThomson-Reuters\nArango Search\nDell EMC\nGoogle S2 Index\nArangoDB Geographic Functionality\nJSON Schema\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"Using a multi-model database in your applications can greatly reduce the amount of infrastructure and complexity required. ArangoDB is a storage engine that supports documents, dey/value, and graph data formats, as well as being fast and scalable. In this episode Jan Steeman and Jan Stücke explain where Arango fits in the crowded database market, how it works under the hood, and how you can start working with it today.
\n\nLinks
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Fast, Scalable, and Flexible Data Storage with ArangoDB (Interview)","date_published":"2018-06-03T23:30:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/171c3a59-3f12-4817-bfb8-dabfd676450f.mp3","mime_type":"audio/mpeg","size_in_bytes":30669129,"duration_in_seconds":2405}]},{"id":"podlove-2018-05-27t10:26:45+00:00-c4f1db2e56df9d0","title":"The Alooma Data Pipeline With CTO Yair Weinberger - Episode 33","url":"https://www.dataengineeringpodcast.com/alooma-with-yair-weinberger-episode-33","content_text":"Summary\n\nBuilding an ETL pipeline is a common need across businesses and industries. It’s easy to get one started but difficult to manage as new requirements are added and greater scalability becomes necessary. Rather than duplicating the efforts of other engineers it might be best to use a hosted service to handle the plumbing so that you can focus on the parts that actually matter for your business. In this episode CTO and co-founder of Alooma, Yair Weinberger, explains how the platform addresses the common needs of data collection, manipulation, and storage while allowing for flexible processing. He describes the motivation for starting the company, how their infrastructure is architected, and the challenges of supporting multi-tenancy and a wide variety of integrations.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nFor complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Yair Weinberger about Alooma, a company providing data pipelines as a service\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat is Alooma and what is the origin story?\nHow is the Alooma platform architected?\n\nI want to go into stream VS batch here\nWhat are the most challenging components to scale?\n\n\n\nHow do you manage the underlying infrastructure to support your SLA of 5 nines?\nWhat are some of the complexities introduced by processing data from multiple customers with various compliance requirements?\n\n\nHow do you sandbox user’s processing code to avoid security exploits?\n\n\n\nWhat are some of the potential pitfalls for automatic schema management in the target database?\nGiven the large number of integrations, how do you maintain the\n\n\nWhat are some challenges when creating integrations, isn’t it simply conforming with an external API?\n\n\n\nFor someone getting started with Alooma what does the workflow look like?\nWhat are some of the most challenging aspects of building and maintaining Alooma?\nWhat are your plans for the future of Alooma?\n\n\nContact Info\n\n\nLinkedIn\n@yairwein on Twitter\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nAlooma\nConvert Media\nData Integration\nESB (Enterprise Service Bus)\nTibco\nMulesoft\nETL (Extract, Transform, Load)\nInformatica\nMicrosoft SSIS\nOLAP Cube\nS3\nAzure Cloud Storage\nSnowflake DB\nRedshift\nBigQuery\nSalesforce\nHubspot\nZendesk\nSpark\nThe Log: What every software engineer should know about real-time data’s unifying abstraction by Jay Kreps\nRDBMS (Relational Database Management System)\nSaaS (Software as a Service)\nChange Data Capture\nKafka\nStorm\nGoogle Cloud PubSub\nAmazon Kinesis\nAlooma Code Engine\nZookeeper\nIdempotence\nKafka Streams\nKubernetes\nSOC2\nJython\nDocker\nPython\nJavascript\nRuby\nScala\nPII (Personally Identifiable Information)\nGDPR (General Data Protection Regulation)\nAmazon EMR (Elastic Map Reduce)\nSequoia Capital\nLightspeed Investors\nRedis\nAerospike\nCassandra\nMongoDB\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"Building an ETL pipeline is a common need across businesses and industries. It’s easy to get one started but difficult to manage as new requirements are added and greater scalability becomes necessary. Rather than duplicating the efforts of other engineers it might be best to use a hosted service to handle the plumbing so that you can focus on the parts that actually matter for your business. In this episode CTO and co-founder of Alooma, Yair Weinberger, explains how the platform addresses the common needs of data collection, manipulation, and storage while allowing for flexible processing. He describes the motivation for starting the company, how their infrastructure is architected, and the challenges of supporting multi-tenancy and a wide variety of integrations.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"The Alooma Cloud Data Pipeline Deep Dive (Interview)","date_published":"2018-05-27T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/3bb7d627-af1b-4547-8234-59ea5679e873.mp3","mime_type":"audio/mpeg","size_in_bytes":34743742,"duration_in_seconds":2870}]},{"id":"podlove-2018-05-21t00:07:21+00:00-52148b622ee85a5","title":"PrestoDB and Starburst Data with Kamil Bajda-Pawlikowski - Episode 32","url":"https://www.dataengineeringpodcast.com/prestodb-at-starburst-data-with-kamil-bajda-pawlikowski-episode-32","content_text":"Summary\n\nMost businesses end up with data in a myriad of places with varying levels of structure. This makes it difficult to gain insights from across departments, projects, or people. Presto is a distributed SQL engine that allows you to tie all of your information together without having to first aggregate it all into a data warehouse. Kamil Bajda-Pawlikowski co-founded Starburst Data to provide support and tooling for Presto, as well as contributing advanced features back to the project. In this episode he describes how Presto is architected, how you can use it for your analytics, and the work that he is doing at Starburst Data.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Kamil Bajda-Pawlikowski about Presto and his experiences with supporting it at Starburst Data\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Presto is?\n\nWhat are some of the common use cases and deployment patterns for Presto?\n\n\n\nHow does Presto compare to Drill or Impala?\nWhat is it about Presto that led you to building a business around it?\nWhat are some of the most challenging aspects of running and scaling Presto?\nFor someone who is using the Presto SQL interface, what are some of the considerations that they should keep in mind to avoid writing poorly performing queries?\n\n\nHow does Presto represent data for translating between its SQL dialect and the API of the data stores that it interfaces with?\n\n\n\nWhat are some cases in which Presto is not the right solution?\nWhat types of support have you found to be the most commonly requested?\nWhat are some of the types of tooling or improvements that you have made to Presto in your distribution?\n\n\nWhat are some of the notable changes that your team has contributed upstream to Presto?\n\n\n\n\n\nContact Info\n\n\nWebsite\nE-mail\nTwitter – @starburstdata\nTwitter – @prestodb\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nStarburst Data\nPresto\nHadapt\nHadoop\nHive\nTeradata\nPrestoCare\nCost Based Optimizer\nANSI SQL\nSpill To Disk\nTempto\nBenchto\nGeospatial Functions\nCassandra\nAccumulo\nKafka\nRedis\nPostGreSQL\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / {CC BY-SA](http://creativecommons.org/licenses/by-sa/3.0/)?utm_source=rss&utm_medium=rss","content_html":"Most businesses end up with data in a myriad of places with varying levels of structure. This makes it difficult to gain insights from across departments, projects, or people. Presto is a distributed SQL engine that allows you to tie all of your information together without having to first aggregate it all into a data warehouse. Kamil Bajda-Pawlikowski co-founded Starburst Data to provide support and tooling for Presto, as well as contributing advanced features back to the project. In this episode he describes how Presto is architected, how you can use it for your analytics, and the work that he is doing at Starburst Data.
\n\nLinks
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / {CC BY-SA](http://creativecommons.org/licenses/by-sa/3.0/)?utm_source=rss&utm_medium=rss
","summary":"Analyzing Your Data Lake With PrestoDB (Interview)","date_published":"2018-05-20T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/f3317bc1-ff2f-487a-b061-812c877a36b9.mp3","mime_type":"audio/mpeg","size_in_bytes":26228639,"duration_in_seconds":2527}]},{"id":"podlove-2018-05-14t00:32:22+00:00-e74035130daddf1","title":"Brief Conversations From The Open Data Science Conference: Part 2 - Episode 31","url":"https://www.dataengineeringpodcast.com/odsc-east-2018-part-2-episode-31","content_text":"Summary\n\nThe Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Andy Eschbacher of Carto. He dscribes some of the complexities inherent to working with geospatial data, how they are handling it, and some of the interesting use cases that they enable for their customers. Next is Todd Blaschka, COO of TigerGraph. He explains how graph databases differ from relational engines, where graph algorithms are useful, and how TigerGraph is built to alow for fast and scalable operation.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.\nYour host is Tobias Macey and last week I attended the Open Data Science Conference in Boston and recorded a few brief interviews on-site. In this second part you will hear from Andy Eschbacher of Carto about the challenges of managing geospatial data, as well as Todd Blaschka of TigerGraph about graph databases and how his company has managed to build a fast and scalable platform for graph storage and traversal.\n\n\nInterview\n\nAndy Eschbacher From Carto\n\n\nWhat are the challenges associated with storing geospatial data?\nWhat are some of the common misconceptions that people have about working with geospatial data?\n\n\nContact Info\n\n\nandy-esch on GitHub\n@MrEPhysics on Twitter\nWebsite\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nCarto\nGeospatial Analysis\nGeoJSON\n\n\nTodd Blaschka From TigerGraph\n\n\nWhat are graph databases and how do they differ from relational engines?\nWhat are some of the common difficulties that people have when deling with graph algorithms?\nHow does data modeling for graph databases differ from relational stores?\n\n\nContact Info\n\n\nLinkedIn\n@toddblaschka on Twitter\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nTigerGraph\nGraph Databases\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Andy Eschbacher of Carto. He dscribes some of the complexities inherent to working with geospatial data, how they are handling it, and some of the interesting use cases that they enable for their customers. Next is Todd Blaschka, COO of TigerGraph. He explains how graph databases differ from relational engines, where graph algorithms are useful, and how TigerGraph is built to alow for fast and scalable operation.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"A Brief Look At Geospatial Data And Graph Databases (Interview)","date_published":"2018-05-13T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/8f6d4025-42d4-4884-845e-2395ee6153ec.mp3","mime_type":"audio/mpeg","size_in_bytes":20368614,"duration_in_seconds":1565}]},{"id":"podlove-2018-05-07t01:39:15+00:00-a2cfe6962d8320d","title":"Brief Conversations From The Open Data Science Conference: Part 1 - Episode 30","url":"https://www.dataengineeringpodcast.com/odsc-east-2018-part-1-episode-30","content_text":"Summary\n\nThe Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Alan Anders, the CTO of Applecart about their challenges with getting Spark to scale for constructing an entity graph from multiple data sources. Next I spoke with Stepan Pushkarev, the CEO, CTO, and Co-Founder of Hydrosphere.io about the challenges of running machine learning models in production and how his team tracks key metrics and samples production data to re-train and re-deploy those models for better accuracy and more robust operation.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYour host is Tobias Macey and this week I attended the Open Data Science Conference in Boston and recorded a few brief interviews on-site. First up you’ll hear from Alan Anders, the CTO of Applecart about their challenges with getting Spark to scale for constructing an entity graph from multiple data sources. Next I spoke with Stepan Pushkarev, the CEO, CTO, and Co-Founder of Hydrosphere.io about the challenges of running machine learning models in production and how his team tracks key metrics and samples production data to re-train and re-deploy those models for better accuracy and more robust operation.\n\n\nInterview\n\nAlan Anders from Applecart\n\n\nWhat are the challenges of gathering and processing data from multiple data sources and representing them in a unified manner for merging into single entities?\nWhat are the biggest technical hurdles at Applecart?\n\n\nContact Info\n\n\n@alanjanders on Twitter\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nSpark\nDataBricks\nDataBricks Delta\nApplecart\n\n\nStepan Pushkarev from Hydrosphere.io\n\n\nWhat is Hydropshere.io?\nWhat metrics do you track to determine when a machine learning model is not producing an appropriate output?\nHow do you determine which data points to sample for retraining the model?\nHow does the role of a machine learning engineer differ from data engineers and data scientists?\n\n\nContact Info\n\n\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nHydrosphere\nMachine Learning Engineer\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Alan Anders, the CTO of Applecart about their challenges with getting Spark to scale for constructing an entity graph from multiple data sources. Next I spoke with Stepan Pushkarev, the CEO, CTO, and Co-Founder of Hydrosphere.io about the challenges of running machine learning models in production and how his team tracks key metrics and samples production data to re-train and re-deploy those models for better accuracy and more robust operation.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Brief Conversations On Data Engineering From The Open Data Science Conference","date_published":"2018-05-06T23:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/d184cf3e-cf15-4a19-a888-2783ee196c21.mp3","mime_type":"audio/mpeg","size_in_bytes":25280186,"duration_in_seconds":1958}]},{"id":"podlove-2018-04-30t01:02:56+00:00-fa79234da01d7a6","title":"Metabase Self Service Business Intelligence with Sameer Al-Sakran - Episode 29","url":"https://www.dataengineeringpodcast.com/metabase-with-sameer-al-sakran-episode-29","content_text":"Summary\n\nBusiness Intelligence software is often cumbersome and requires specialized knowledge of the tools and data to be able to ask and answer questions about the state of the organization. Metabase is a tool built with the goal of making the act of discovering information and asking questions of an organizations data easy and self-service for non-technical users. In this episode the CEO of Metabase, Sameer Al-Sakran, discusses how and why the project got started, the ways that it can be used to build and share useful reports, some of the useful features planned for future releases, and how to get it set up to start using it in your environment.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nFor complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Sameer Al-Sakran about Metabase, a free and open source tool for self service business intelligence\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nThe current goal for most companies is to be “data driven”. How would you define that concept?\n\nHow does Metabase assist in that endeavor?\n\n\n\nWhat is the ratio of users that take advantage of the GUI query builder as opposed to writing raw SQL?\n\n\nWhat level of complexity is possible with the query builder?\n\n\n\nWhat have you found to be the typical use cases for Metabase in the context of an organization?\nHow do you manage scaling for large or complex queries?\nWhat was the motivation for using Clojure as the language for implementing Metabase?\nWhat is involved in adding support for a new data source?\nWhat are the differentiating features of Metabase that would lead someone to choose it for their organization?\nWhat have been the most challenging aspects of building and growing Metabase, both from a technical and business perspective?\nWhat do you have planned for the future of Metabase?\n\n\nContact Info\n\n\nSameer\n\nsalsakran on GitHub\n@sameer_alsakran on Twitter\nLinkedIn\n\n\n\nMetabase\n\n\nWebsite\n@metabase on Twitter\nmetabase on GitHub\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nExpa\nMetabase\nBlackjet\nHadoop\nImeem\nMaslow’s Hierarchy of Data Needs\n2 Sided Marketplace\nHoneycomb Interview\nExcel\nTableau\nGo-JEK\nClojure\nReact\nPython\nScala\nJVM\nRedash\nHow To Lie With Data\nStripe\nBraintree Payments\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"Business Intelligence software is often cumbersome and requires specialized knowledge of the tools and data to be able to ask and answer questions about the state of the organization. Metabase is a tool built with the goal of making the act of discovering information and asking questions of an organizations data easy and self-service for non-technical users. In this episode the CEO of Metabase, Sameer Al-Sakran, discusses how and why the project got started, the ways that it can be used to build and share useful reports, some of the useful features planned for future releases, and how to get it set up to start using it in your environment.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Self Service Business Intelligence For Everyone With Metabase (Interview)","date_published":"2018-04-29T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/712d9794-7d42-4251-9769-c55d5ee98c3e.mp3","mime_type":"audio/mpeg","size_in_bytes":34587427,"duration_in_seconds":2686}]},{"id":"podlove-2018-04-22t02:54:01+00:00-f106327a9fd1407","title":"Octopai: Metadata Management for Better Business Intelligence with Amnon Drori - Episode 28","url":"https://www.dataengineeringpodcast.com/octopai-with-amnon-drori-episode-28","content_text":"Summary\n\nThe information about how data is acquired and processed is often as important as the data itself. For this reason metadata management systems are built to track the journey of your business data to aid in analysis, presentation, and compliance. These systems are frequently cumbersome and difficult to maintain, so Octopai was founded to alleviate that burden. In this episode Amnon Drori, CEO and co-founder of Octopai, discusses the business problems he witnessed that led him to starting the company, how their systems are able to provide valuable tools and insights, and the direction that their product will be taking in the future.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 200Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nFor complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Amnon Drori about OctopAI and the benefits of metadata management\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat is OctopAI and what was your motivation for founding it?\nWhat are some of the types of information that you classify and collect as metadata?\nCan you talk through the architecture of your platform?\nWhat are some of the challenges that are typically faced by metadata management systems?\nWhat is involved in deploying your metadata collection agents?\nOnce the metadata has been collected what are some of the ways in which it can be used?\nWhat mechanisms do you use to ensure that customer data is segregated?\n\nHow do you identify and handle sensitive information during the collection step?\n\n\n\nWhat are some of the most challenging aspects of your technical and business platforms that you have faced?\nWhat are some of the plans that you have for OctopAI going forward?\n\n\nContact Info\n\n\nAmnon\n\nLinkedIn\n@octopai_amnon on Twitter\n\n\n\nOctopAI\n\n\n@OctopaiBI on Twitter\nWebsite\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nOctopAI\nMetadata\nMetadata Management\nData Integrity\nCRM (Customer Relationship Management)\nERP (Enterprise Resource Planning)\nBusiness Intelligence\nETL (Extract, Transform, Load)\nInformatica\nSAP\nData Governance\nSSIS (SQL Server Integration Services)\nVertica\nAirflow\nLuigi\nOozie\nGDPR (General Data Privacy Regulation)\nRoot Cause Analysis\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"The information about how data is acquired and processed is often as important as the data itself. For this reason metadata management systems are built to track the journey of your business data to aid in analysis, presentation, and compliance. These systems are frequently cumbersome and difficult to maintain, so Octopai was founded to alleviate that burden. In this episode Amnon Drori, CEO and co-founder of Octopai, discusses the business problems he witnessed that led him to starting the company, how their systems are able to provide valuable tools and insights, and the direction that their product will be taking in the future.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Octopai Managed Metadata Service For Better Business Intelligence (Interview)","date_published":"2018-04-22T20:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/7e2a04d9-8dc1-4426-8d70-b4832d3568f4.mp3","mime_type":"audio/mpeg","size_in_bytes":23520376,"duration_in_seconds":2392}]},{"id":"podlove-2018-04-15t03:10:47+00:00-4eb35f774e35852","title":"Data Engineering Weekly with Joe Crobak - Episode 27","url":"https://www.dataengineeringpodcast.com/data-engineering-weekly-with-joe-crobak-episode-27","content_text":"Summary\n\nThe rate of change in the data engineering industry is alternately exciting and exhausting. Joe Crobak found his way into the work of data management by accident as so many of us do. After being engrossed with researching the details of distributed systems and big data management for his work he began sharing his findings with friends. This led to his creation of the Hadoop Weekly newsletter, which he recently rebranded as the Data Engineering Weekly newsletter. In this episode he discusses his experiences working as a data engineer in industry and at the USDS, his motivations and methods for creating a newsleteter, and the insights that he has gleaned from it.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Joe Crobak about his work maintaining the Data Engineering Weekly newsletter, and the challenges of keeping up with the data engineering industry.\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat are some of the projects that you have been involved in that were most personally fulfilling?\n\nAs an engineer at the USDS working on the healthcare.gov and medicare systems, what were some of the approaches that you used to manage sensitive data?\nHealthcare.gov has a storied history, how did the systems for processing and managing the data get architected to handle the amount of load that it was subjected to?\n\n\n\nWhat was your motivation for starting a newsletter about the Hadoop space?\n\n\nCan you speak to your reasoning for the recent rebranding of the newsletter?\n\n\n\nHow much of the content that you surface in your newsletter is found during your day-to-day work, versus explicitly searching for it?\nAfter over 5 years of following the trends in data analytics and data infrastructure what are some of the most interesting or surprising developments?\n\n\nWhat have you found to be the fundamental skills or areas of experience that have maintained relevance as new technologies in data engineering have emerged?\n\n\n\nWhat is your workflow for finding and curating the content that goes into your newsletter?\nWhat is your personal algorithm for filtering which articles, tools, or commentary gets added to the final newsletter?\nHow has your experience managing the newsletter influenced your areas of focus in your work and vice-versa?\nWhat are your plans going forward?\n\n\nContact Info\n\n\nData Eng Weekly\nEmail\nTwitter – @joecrobak\nTwitter – @dataengweekly\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nUSDS\nNational Labs\nCray\nAmazon EMR (Elastic Map-Reduce)\nRecommendation Engine\nNetflix Prize\nHadoop\nCloudera\nPuppet\nhealthcare.gov\nMedicare\nQuality Payment Program\nHIPAA\nNIST National Institute of Standards and Technology\nPII (Personally Identifiable Information)\nThreat Modeling\nApache JBoss\nApache Web Server\nMarkLogic\nJMS (Java Message Service)\nLoad Balancer\nCOBOL\nHadoop Weekly\nData Engineering Weekly\nFoursquare\nNiFi\nKubernetes\nSpark\nFlink\nStream Processing\nDataStax\nRSS\nThe Flavors of Data Science and Engineering\nCQRS\nChange Data Capture\nJay Kreps\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"The rate of change in the data engineering industry is alternately exciting and exhausting. Joe Crobak found his way into the work of data management by accident as so many of us do. After being engrossed with researching the details of distributed systems and big data management for his work he began sharing his findings with friends. This led to his creation of the Hadoop Weekly newsletter, which he recently rebranded as the Data Engineering Weekly newsletter. In this episode he discusses his experiences working as a data engineer in industry and at the USDS, his motivations and methods for creating a newsleteter, and the insights that he has gleaned from it.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Keeping Up With The Data Engineering industry (Interview)","date_published":"2018-04-14T23:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/c6ab869d-1981-4020-8553-15cdc7ebd1fb.mp3","mime_type":"audio/mpeg","size_in_bytes":29824154,"duration_in_seconds":2612}]},{"id":"podlove-2018-04-08t21:19:27+00:00-aa15772fb2a12ec","title":"Defining DataOps with Chris Bergh - Episode 26","url":"https://www.dataengineeringpodcast.com/datakitchen-dataops-with-chris-bergh-episode-26","content_text":"Summary\n\nManaging an analytics project can be difficult due to the number of systems involved and the need to ensure that new information can be delivered quickly and reliably. That challenge can be met by adopting practices and principles from lean manufacturing and agile software development, and the cross-functional collaboration, feedback loops, and focus on automation in the DevOps movement. In this episode Christopher Bergh discusses ways that you can start adding reliability and speed to your workflow to deliver results with confidence and consistency.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nFor complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Christopher Bergh about DataKitchen and the rise of DataOps\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nHow do you define DataOps?\n\nHow does it compare to the practices encouraged by the DevOps movement?\nHow does it relate to or influence the role of a data engineer?\n\n\n\nHow does a DataOps oriented workflow differ from other existing approaches for building data platforms?\nOne of the aspects of DataOps that you call out is the practice of providing multiple environments to provide a platform for testing the various aspects of the analytics workflow in a non-production context. What are some of the techniques that are available for managing data in appropriate volumes across those deployments?\nThe practice of testing logic as code is fairly well understood and has a large set of existing tools. What have you found to be some of the most effective methods for testing data as it flows through a system?\nOne of the practices of DevOps is to create feedback loops that can be used to ensure that business needs are being met. What are the metrics that you track in your platform to define the value that is being created and how the various steps in the workflow are proceeding toward that goal?\n\n\nIn order to keep feedback loops fast it is necessary for tests to run quickly. How do you balance the need for larger quantities of data to be used for verifying scalability/performance against optimizing for cost and speed in non-production environments?\n\n\n\nHow does the DataKitchen platform simplify the process of operationalizing a data analytics workflow?\nAs the need for rapid iteration and deployment of systems to capture, store, process, and analyze data becomes more prevalent how do you foresee that feeding back into the ways that the landscape of data tools are designed and developed?\n\n\nContact Info\n\n\nLinkedIn\n@ChrisBergh on Twitter\nEmail\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nDataOps Manifesto\nDataKitchen\n2017: The Year Of DataOps\nAir Traffic Control\nChief Data Officer (CDO)\nGartner\nW. Edwards Deming\nDevOps\nTotal Quality Management (TQM)\nInformatica\nTalend\nAgile Development\nCattle Not Pets\nIDE (Integrated Development Environment)\nTableau\nDelphix\nDremio\nPachyderm\nContinuous Delivery by Jez Humble and Dave Farley\nSLAs (Service Level Agreements)\nXKCD Image Recognition Comic\nAirflow\nLuigi\nDataKitchen Documentation\nContinuous Integration\nContinous Delivery\nDocker\nVersion Control\nGit\nLooker\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"Managing an analytics project can be difficult due to the number of systems involved and the need to ensure that new information can be delivered quickly and reliably. That challenge can be met by adopting practices and principles from lean manufacturing and agile software development, and the cross-functional collaboration, feedback loops, and focus on automation in the DevOps movement. In this episode Christopher Bergh discusses ways that you can start adding reliability and speed to your workflow to deliver results with confidence and consistency.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Building Better Analytics Using DataOps (Interview)","date_published":"2018-04-08T17:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/aad5622b-be29-4965-9048-e61c78d064dc.mp3","mime_type":"audio/mpeg","size_in_bytes":40484381,"duration_in_seconds":3270}]},{"id":"podlove-2018-04-01t13:04:23+00:00-798a607d34e4b0a","title":"ThreatStack: Data Driven Cloud Security with Pete Cheslock and Patrick Cable - Episode 25","url":"https://www.dataengineeringpodcast.com/threatstack-with-pete-cheslock-and-patrick-cable-episode-25","content_text":"Summary\n\nCloud computing and ubiquitous virtualization have changed the ways that our applications are built and deployed. This new environment requires a new way of tracking and addressing the security of our systems. ThreatStack is a platform that collects all of the data that your servers generate and monitors for unexpected anomalies in behavior that would indicate a breach and notifies you in near-realtime. In this episode ThreatStack’s director of operations, Pete Cheslock, and senior infrastructure security engineer, Patrick Cable, discuss the data infrastructure that supports their platform, how they capture and process the data from client systems, and how that information can be used to keep your systems safe from attackers.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nFor complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Pete Cheslock and Pat Cable about the data infrastructure and security controls at ThreatStack\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhy don’t you start by explaining what ThreatStack does?\n\nWhat was lacking in the existing options (services and self-hosted/open source) that ThreatStack solves for?\n\n\n\nCan you describe the type(s) of data that you collect and how it is structured?\nWhat is the high level data infrastructure that you use for ingesting, storing, and analyzing your customer data?\n\n\nHow do you ensure a consistent format of the information that you receive?\nHow do you ensure that the various pieces of your platform are deployed using the proper configurations and operating as intended?\nHow much configuration do you provide to the end user in terms of the captured data, such as sampling rate or additional context?\n\n\n\nI understand that your original architecture used RabbitMQ as your ingest mechanism, which you then migrated to Kafka. What was your initial motivation for that change?\n\n\nHow much of a benefit has that been in terms of overall complexity and cost (both time and infrastructure)?\n\n\n\nHow do you ensure the security and provenance of the data that you collect as it traverses your infrastructure?\nWhat are some of the most common vulnerabilities that you detect in your client’s infrastructure?\nFor someone who wants to start using ThreatStack, what does the setup process look like?\nWhat have you found to be the most challenging aspects of building and managing the data processes in your environment?\nWhat are some of the projects that you have planned to improve the capacity or capabilities of your infrastructure?\n\n\nContact Info\n\n\nPete Cheslock\n\n@petecheslock on Twitter\nWebsite\npetecheslock on GitHub\n\n\n\nPatrick Cable\n\n\n@patcable on Twitter\nWebsite\npatcable on GitHub\n\n\n\nThreatStack\n\n\nWebsite\n@threatstack on Twitter\nthreatstack on GitHub\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nThreatStack\nSecDevOps\nSonian\nEC2\nSnort\nSnorby\nSuricata\nTripwire\nSyscall (System Call)\nAuditD\nCloudTrail\nNaxsi\nCloud Native\nFile Integrity Monitoring (FIM)\nAmazon Web Services (AWS)\nRabbitMQ\nZeroMQ\nKafka\nSpark\nSlack\nPagerDuty\nJSON\nMicroservices\nCassandra\nElasticSearch\nSensu\nService Discovery\nHoneypot\nKubernetes\nPostGreSQL\nDruid\nFlink\nLaunch Darkly\nChef\nConsul\nTerraform\nCloudFormation\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"Cloud computing and ubiquitous virtualization have changed the ways that our applications are built and deployed. This new environment requires a new way of tracking and addressing the security of our systems. ThreatStack is a platform that collects all of the data that your servers generate and monitors for unexpected anomalies in behavior that would indicate a breach and notifies you in near-realtime. In this episode ThreatStack’s director of operations, Pete Cheslock, and senior infrastructure security engineer, Patrick Cable, discuss the data infrastructure that supports their platform, how they capture and process the data from client systems, and how that information can be used to keep your systems safe from attackers.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Using Anomaly Detection To Secure Your Cloud with ThreatStack (Interview)","date_published":"2018-04-01T16:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/216f38c6-3c2b-4af2-bc9b-217d614c5571.mp3","mime_type":"audio/mpeg","size_in_bytes":35018801,"duration_in_seconds":3112}]},{"id":"podlove-2018-03-25t17:38:23+00:00-9f1f29a905e9bbb","title":"MarketStore: Managing Timeseries Financial Data with Hitoshi Harada and Christopher Ryan - Episode 24","url":"https://www.dataengineeringpodcast.com/marketstore-with-hitoshi-harada-and-christopher-ryan-episode-24","content_text":"Summary\n\nThe data that is used in financial markets is time oriented and multidimensional, which makes it difficult to manage in either relational or timeseries databases. To make this information more manageable the team at Alapaca built a new data store specifically for retrieving and analyzing data generated by trading markets. In this episode Hitoshi Harada, the CTO of Alapaca, and Christopher Ryan, their lead software engineer, explain their motivation for building MarketStore, how it operates, and how it has helped to simplify their development workflows.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nFor complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Christopher Ryan and Hitoshi Harada about MarketStore, a storage server for large volumes of financial timeseries data\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat was your motivation for creating MarketStore?\nWhat are the characteristics of financial time series data that make it challenging to manage?\nWhat are some of the workflows that MarketStore is used for at Alpaca and how were they managed before it was available?\nWith MarketStore’s data coming from multiple third party services, how are you managing to keep the DB up-to-date and in sync with those services?\n\nWhat is the worst case scenario if there is a total failure in the data store?\nWhat guards have you built to prevent such a situation from occurring?\n\n\n\nSince MarketStore is used for querying and analyzing data having to do with financial markets and there are potentially large quantities of money being staked on the results of that analysis, how do you ensure that the operations being performed in MarketStore are accurate and repeatable?\nWhat were the most challenging aspects of building MarketStore and integrating it into the rest of your systems?\nMotivation for open sourcing the code?\nWhat is the next planned major feature for MarketStore, and what use-case is it aiming to support?\n\n\nContact Info\n\n\nChristopher\n\nEmail\n\n\n\nHitoshi\n\n\nEmail\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nMarketStore\n\nGitHub\nRelease Announcement\n\n\n\nAlpaca\nIBM\nDB2\nGreenPlum\nAlgorithmic Trading\nBacktesting\nOHLC (Open-High-Low-Close)\nHDF5\nGolang\nC++\nTimeseries Database List\nInfluxDB\nJSONRPC\nSlait\nCircleCI\nGDAX\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"The data that is used in financial markets is time oriented and multidimensional, which makes it difficult to manage in either relational or timeseries databases. To make this information more manageable the team at Alapaca built a new data store specifically for retrieving and analyzing data generated by trading markets. In this episode Hitoshi Harada, the CTO of Alapaca, and Christopher Ryan, their lead software engineer, explain their motivation for building MarketStore, how it operates, and how it has helped to simplify their development workflows.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Fast and Scalable Financial Timeseries Dataframes with MarketStore (Interview)","date_published":"2018-03-25T15:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/1951913b-6d96-4887-a3c4-5ccd80284599.mp3","mime_type":"audio/mpeg","size_in_bytes":23889511,"duration_in_seconds":2007}]},{"id":"podlove-2018-03-19t01:27:20+00:00-ff36314798bd73d","title":"Stretching The Elastic Stack with Philipp Krenn - Episode 23","url":"https://www.dataengineeringpodcast.com/elastic-stack-with-philipp-krenn-episode-23","content_text":"Summary\n\nSearch is a common requirement for applications of all varieties. Elasticsearch was built to make it easy to include search functionality in projects built in any language. From that foundation, the rest of the Elastic Stack has been built, expanding to many more use cases in the proces. In this episode Philipp Krenn describes the various pieces of the stack, how they fit together, and how you can use them in your infrastructure to store, search, and analyze your data.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data management\nWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.\nFor complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYour host is Tobias Macey and today I’m interviewing Philipp Krenn about the Elastic Stack and the ways that you can use it in your systems\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nThe Elasticsearch product has been around for a long time and is widely known, but can you give a brief overview of the other components that make up the Elastic Stack and how they work together?\nBeyond the common pattern of using Elasticsearch as a search engine connected to a web application, what are some of the other use cases for the various pieces of the stack?\nWhat are the common scaling bottlenecks that users should be aware of when they are dealing with large volumes of data?\nWhat do you consider to be the biggest competition to the Elastic Stack as you expand the capabilities and target usage patterns?\nWhat are the biggest challenges that you are tackling in the Elastic stack, technical or otherwise?\nWhat are the biggest challenges facing Elastic as a company in the near to medium term?\nOpen source as a business model: https://www.elastic.co/blog/doubling-down-on-open?utm_source=rss&utm_medium=rss\nWhat is the vision for Elastic and the Elastic Stack going forward and what new features or functionality can we look forward to?\n\n\nContact Info\n\n\n@xeraa on Twitter\nxeraa on GitHub\nWebsite\nEmail\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nElastic\nVienna – Capital of Austria\nWhat Is Developer Advocacy?\nNoSQL\nMongoDB\nElasticsearch\nCassandra\nNeo4J\nHazelcast\nApache Lucene\nLogstash\nKibana\nBeats\nX-Pack\nELK Stack\nMetrics\nAPM (Application Performance Monitoring)\nGeoJSON\nSplit Brain\nElasticsearch Ingest Nodes\nPacketBeat\nElastic Cloud\nElasticon\nKibana Canvas\nSwiftType\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"Search is a common requirement for applications of all varieties. Elasticsearch was built to make it easy to include search functionality in projects built in any language. From that foundation, the rest of the Elastic Stack has been built, expanding to many more use cases in the proces. In this episode Philipp Krenn describes the various pieces of the stack, how they fit together, and how you can use them in your infrastructure to store, search, and analyze your data.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Exploring The Elastic Stack: From Text Search To Metrics Platform (Interview)","date_published":"2018-03-18T21:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/920cad85-4c9c-4f4d-a341-6e9ec27edc0e.mp3","mime_type":"audio/mpeg","size_in_bytes":40162826,"duration_in_seconds":3062}]},{"id":"podlove-2018-03-12t04:04:07+00:00-776468806187efe","title":"Database Refactoring Patterns with Pramod Sadalage - Episode 22","url":"https://www.dataengineeringpodcast.com/database-refactoring-patterns-with-pramod-sadalage-episode-22","content_text":"Summary\n\nAs software lifecycles move faster, the database needs to be able to keep up. Practices such as version controlled migration scripts and iterative schema evolution provide the necessary mechanisms to ensure that your data layer is as agile as your application. Pramod Sadalage saw the need for these capabilities during the early days of the introduction of modern development practices and co-authored a book to codify a large number of patterns to aid practitioners, and in this episode he reflects on the current state of affairs and how things have changed over the past 12 years.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nYour host is Tobias Macey and today I’m interviewing Pramod Sadalage about refactoring databases and integrating database design into an iterative development workflow\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nYou first co-authored Refactoring Databases in 2006. What was the state of software and database system development at the time and why did you find it necessary to write a book on this subject?\nWhat are the characteristics of a database that make them more difficult to manage in an iterative context?\nHow does the practice of refactoring in the context of a database compare to that of software?\nHow has the prevalence of data abstractions such as ORMs or ODMs impacted the practice of schema design and evolution?\nIs there a difference in strategy when refactoring the data layer of a system when using a non-relational storage system?\nHow has the DevOps movement and the increased focus on automation affected the state of the art in database versioning and evolution?\nWhat have you found to be the most problematic aspects of databases when trying to evolve the functionality of a system?\nLooking back over the past 12 years, what has changed in the areas of database design and evolution?\n\nHow has the landscape of tooling for managing and applying database versioning changed since you first wrote Refactoring Databases?\nWhat do you see as the biggest challenges facing us over the next few years?\n\n\n\n\n\nContact Info\n\n\nWebsite\npramodsadalage on GitHub\n@pramodsadalage on Twitter\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nDatabase Refactoring\n\nWebsite\nBook\n\n\n\nThoughtworks\nMartin Fowler\nAgile Software Development\nXP (Extreme Programming)\nContinuous Integration\n\n\nThe Book\nWikipedia\n\n\n\nTest First Development\nDDL (Data Definition Language)\nDML (Data Modification Language)\nDevOps\nFlyway\nLiquibase\nDBMaintain\nHibernate\nSQLAlchemy\nORM (Object Relational Mapper)\nODM (Object Document Mapper)\nNoSQL\nDocument Database\nMongoDB\nOrientDB\nCouchBase\nCassandraDB\nNeo4j\nArangoDB\nUnit Testing\nIntegration Testing\nOLAP (On-Line Analytical Processing)\nOLTP (On-Line Transaction Processing)\nData Warehouse\nDocker\nQA==Quality Assurance\nHIPAA (Health Insurance Portability and Accountability Act)\nPCI DSS (Payment Card Industry Data Security Standard)\nPolyglot Persistence\nToplink Java ORM\nRuby on Rails\nActiveRecord Gem\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"As software lifecycles move faster, the database needs to be able to keep up. Practices such as version controlled migration scripts and iterative schema evolution provide the necessary mechanisms to ensure that your data layer is as agile as your application. Pramod Sadalage saw the need for these capabilities during the early days of the introduction of modern development practices and co-authored a book to codify a large number of patterns to aid practitioners, and in this episode he reflects on the current state of affairs and how things have changed over the past 12 years.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Evolutionary Database Design and Refactoring (Interview)","date_published":"2018-03-12T00:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/a7d6c3b1-a674-43f8-962f-9fd91b944e64.mp3","mime_type":"audio/mpeg","size_in_bytes":35393524,"duration_in_seconds":2945}]},{"id":"podlove-2018-03-05t02:24:19+00:00-b345c9d66278a00","title":"The Future Data Economy with Roger Chen - Episode 21","url":"https://www.dataengineeringpodcast.com/data-economy-with-roger-chen-episode-21","content_text":"Summary\n\nData is an increasingly sought after raw material for business in the modern economy. One of the factors driving this trend is the increase in applications for machine learning and AI which require large quantities of information to work from. As the demand for data becomes more widespread the market for providing it will begin transform the ways that information is collected and shared among and between organizations. With his experience as a chair for the O’Reilly AI conference and an investor for data driven businesses Roger Chen is well versed in the challenges and solutions being facing us. In this episode he shares his perspective on the ways that businesses can work together to create shared data resources that will allow them to reduce the redundancy of their foundational data and improve their overall effectiveness in collecting useful training sets for their particular products.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nA few announcements:\n\nThe O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20%\nIf you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.\n\n\n\nYour host is Tobias Macey and today I’m interviewing Roger Chen about data liquidity and its impact on our future economies\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nYou wrote an essay discussing how the increasing usage of machine learning and artificial intelligence applications will result in a demand for data that necessitates what you refer to as ‘Data Liquidity’. Can you explain what you mean by that term?\nWhat are some examples of the types of data that you envision as being foundational to multiple organizations and problem domains?\nCan you provide some examples of the structures that could be created to facilitate data sharing across organizational boundaries?\nMany companies view their data as a strategic asset and are therefore loathe to provide access to other individuals or organizations. What encouragement can you provide that would convince them to externalize any of that information?\nWhat kinds of storage and transmission infrastructure and tooling are necessary to allow for wider distribution of, and collaboration on, data assets?\nWhat do you view as being the privacy implications from creating and sharing these larger pools of data inventory?\nWhat do you view as some of the technical challenges associated with identifying and separating shared data from those that are specific to the business model of the organization?\nWith broader access to large data sets, how do you anticipate that impacting the types of businesses or products that are possible for smaller organizations?\n\n\nContact Info\n\n\n@rgrchen on Twitter\nLinkedIn\nAngel List\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nElectrical Engineering\nBerkeley\nSilicon Nanophotonics\nData Liquidity In The Age Of Inference\nData Silos\nExample of a Data Commons Cooperative\nGoogle Maps Moat: An article describing how Google Maps has refined raw data to create a new product\nGenomics\nPhenomics\nImageNet\nOpen Data\nData Brokerage\nSmart Contracts\nIPFS\nDat Protocol\nHomomorphic Encryption\nFileCoin\nData Programming\nSnorkel\n\nWebsite\nPodcast Interview\n\n\n\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"Data is an increasingly sought after raw material for business in the modern economy. One of the factors driving this trend is the increase in applications for machine learning and AI which require large quantities of information to work from. As the demand for data becomes more widespread the market for providing it will begin transform the ways that information is collected and shared among and between organizations. With his experience as a chair for the O’Reilly AI conference and an investor for data driven businesses Roger Chen is well versed in the challenges and solutions being facing us. In this episode he shares his perspective on the ways that businesses can work together to create shared data resources that will allow them to reduce the redundancy of their foundational data and improve their overall effectiveness in collecting useful training sets for their particular products.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Data Liquidity In The AI Economy (Interview)","date_published":"2018-03-04T21:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/872a6d49-12c4-40bd-b432-3de6551e1235.mp3","mime_type":"audio/mpeg","size_in_bytes":31343138,"duration_in_seconds":2567}]},{"id":"podlove-2018-02-26t03:57:29+00:00-32845a6df9b88aa","title":"Honeycomb Data Infrastructure with Sam Stokes - Episode 20","url":"https://www.dataengineeringpodcast.com/honeycomb-data-infrastructure-with-sam-stokes-episode-20","content_text":"Summary\n\nOne of the sources of data that often gets overlooked is the systems that we use to run our businesses. This data is not used to directly provide value to customers or understand the functioning of the business, but it is still a critical component of a successful system. Sam Stokes is an engineer at Honeycomb where he helps to build a platform that is able to capture all of the events and context that occur in our production environments and use them to answer all of your questions about what is happening in your system right now. In this episode he discusses the challenges inherent in capturing and analyzing event data, the tools that his team is using to make it possible, and how this type of knowledge can be used to improve your critical infrastructure.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nA few announcements:\n\nThere is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20%\nThe O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20%\nIf you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.\n\n\n\nYour host is Tobias Macey and today I’m interviewing Sam Stokes about his work at Honeycomb, a modern platform for observability of software systems\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat is Honeycomb and how did you get started at the company?\nCan you start by giving an overview of your data infrastructure and the path that an event takes from ingest to graph?\nWhat are the characteristics of the event data that you are dealing with and what challenges does it pose in terms of processing it at scale?\nIn addition to the complexities of ingesting and storing data with a high degree of cardinality, being able to quickly analyze it for customer reporting poses a number of difficulties. Can you explain how you have built your systems to facilitate highly interactive usage patterns?\nA high degree of visibility into a running system is desirable for developers and systems adminstrators, but they are not always willing or able to invest the effort to fully instrument the code or servers that they want to track. What have you found to be the most difficult aspects of data collection, and do you have any tooling to simplify the implementation for user?\nHow does Honeycomb compare to other systems that are available off the shelf or as a service, and when is it not the right tool?\nWhat have been some of the most challenging aspects of building, scaling, and marketing Honeycomb?\n\n\nContact Info\n\n\n@samstokes on Twitter\nBlog\nsamstokes on GitHub\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nHoneycomb\nRetriever\nMonitoring and Observability\nKafka\nColumn Oriented Storage\nElasticsearch\nElastic Stack\nDjango\nRuby on Rails\nHeroku\nKubernetes\nLaunch Darkly\nSplunk\nDatadog\nCynefin Framework\nGo-Lang\nTerraform\nAWS\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"One of the sources of data that often gets overlooked is the systems that we use to run our businesses. This data is not used to directly provide value to customers or understand the functioning of the business, but it is still a critical component of a successful system. Sam Stokes is an engineer at Honeycomb where he helps to build a platform that is able to capture all of the events and context that occur in our production environments and use them to answer all of your questions about what is happening in your system right now. In this episode he discusses the challenges inherent in capturing and analyzing event data, the tools that his team is using to make it possible, and how this type of knowledge can be used to improve your critical infrastructure.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Event Data Infrastructure at Honeycomb.io (Interview)","date_published":"2018-02-25T23:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/6956dc44-b20e-47de-815c-d1c7ef7b919c.mp3","mime_type":"audio/mpeg","size_in_bytes":31455429,"duration_in_seconds":2493}]},{"id":"podlove-2018-02-19t00:58:11+00:00-391ba93d4336b97","title":"Data Teams with Will McGinnis - Episode 19","url":"https://www.dataengineeringpodcast.com/data-teams-with-will-mcginnis-episode-19","content_text":"Summary\n\nThe responsibilities of a data scientist and a data engineer often overlap and occasionally come to cross purposes. Despite these challenges it is possible for the two roles to work together effectively and produce valuable business outcomes. In this episode Will McGinnis discusses the opinions that he has gained from experience on how data teams can play to their strengths to the benefit of all.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nA few announcements:\n\nThere is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20%\nThe O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20%\nIf you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.\n\n\n\nYour host is Tobias Macey and today I’m interviewing Will McGinnis about the relationship and boundaries between data engineers and data scientists\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nThe terms “Data Scientist” and “Data Engineer” are fluid and seem to have a different meaning for everyone who uses them. Can you share how you define those terms?\nWhat parallels do you see between the relationships of data engineers and data scientists and those of developers and systems administrators?\nIs there a particular size of organization or problem that serves as a tipping point for when you start to separate the two roles into the responsibilities of more than one person or team?\nWhat are the benefits of splitting the responsibilities of data engineering and data science?\n\nWhat are the disadvantages?\n\n\n\nWhat are some strategies to ensure successful interaction between data engineers and data scientists?\nHow do you view these roles evolving as they become more prevalent across companies and industries?\n\n\nContact Info\n\n\nWebsite\nwdm0006 on GitHub\n@willmcginniser on Twitter\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nBlog Post: Tendencies of Data Engineers and Data Scientists\nPredikto\nCategorical Encoders\nDevOps\nSciKit-Learn\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"The responsibilities of a data scientist and a data engineer often overlap and occasionally come to cross purposes. Despite these challenges it is possible for the two roles to work together effectively and produce valuable business outcomes. In this episode Will McGinnis discusses the opinions that he has gained from experience on how data teams can play to their strengths to the benefit of all.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"How Data Teams Work Together (Interview)","date_published":"2018-02-18T20:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/b0214a23-8a42-4afa-adcf-2678ac13c1d9.mp3","mime_type":"audio/mpeg","size_in_bytes":23195336,"duration_in_seconds":1718}]},{"id":"podlove-2018-02-11t15:57:44+00:00-4674d6e85b6a857","title":"TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18","url":"https://www.dataengineeringpodcast.com/timescaledb-with-ajay-kulkarni-and-mike-freedman-episode-18","content_text":"Summary\n\nAs communications between machines become more commonplace the need to store the generated data in a time-oriented manner increases. The market for timeseries data stores has many contenders, but they are not all built to solve the same problems or to scale in the same manner. In this episode the founders of TimescaleDB, Ajay Kulkarni and Mike Freedman, discuss how Timescale was started, the problems that it solves, and how it works under the covers. They also explain how you can start using it in your infrastructure and their plans for the future.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nYour host is Tobias Macey and today I’m interviewing Ajay Kulkarni and Mike Freedman about Timescale DB, a scalable timeseries database built on top of PostGreSQL\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Timescale is and how the project got started?\nThe landscape of time series databases is extensive and oftentimes difficult to navigate. How do you view your position in that market and what makes Timescale stand out from the other options?\nIn your blog post that explains the design decisions for how Timescale is implemented you call out the fact that the inserted data is largely append only which simplifies the index management. How does Timescale handle out of order timestamps, such as from infrequently connected sensors or mobile devices?\nHow is Timescale implemented and how has the internal architecture evolved since you first started working on it?\n\nWhat impact has the 10.0 release of PostGreSQL had on the design of the project?\nIs timescale compatible with systems such as Amazon RDS or Google Cloud SQL?\n\n\n\nFor someone who wants to start using Timescale what is involved in deploying and maintaining it?\nWhat are the axes for scaling Timescale and what are the points where that scalability breaks down?\n\n\nAre you aware of anyone who has deployed it on top of Citus for scaling horizontally across instances?\n\n\n\nWhat has been the most challenging aspect of building and marketing Timescale?\nWhen is Timescale the wrong tool to use for time series data?\nOne of the use cases that you call out on your website is for systems metrics and monitoring. How does Timescale fit into that ecosystem and can it be used along with tools such as Graphite or Prometheus?\nWhat are some of the most interesting uses of Timescale that you have seen?\nWhich came first, Timescale the business or Timescale the database, and what is your strategy for ensuring that the open source project and the company around it both maintain their health?\nWhat features or improvements do you have planned for future releases of Timescale?\n\n\nContact Info\n\n\nAjay\n\nLinkedIn\n@acoustik on Twitter\nTimescale Blog\n\n\n\nMike\n\n\nWebsite\nLinkedIn\n@michaelfreedman on Twitter\nTimescale Blog\n\n\n\nTimescale\n\n\nWebsite\n@timescaledb on Twitter\nGitHub\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nTimescale\nPostGreSQL\nCitus\nTimescale Design Blog Post\nMIT\nNYU\nStanford\nSDN\nPrinceton\nMachine Data\nTimeseries Data\nList of Timeseries Databases\nNoSQL\nOnline Transaction Processing (OLTP)\nObject Relational Mapper (ORM)\nGrafana\nTableau\nKafka\nWhen Boring Is Awesome\nPostGreSQL\nRDS\nGoogle Cloud SQL\nAzure DB\nDocker\nContinuous Aggregates\nStreaming Replication\nPGPool II\nKubernetes\nDocker Swarm\nCitus Data\n\nWebsite\nData Engineering Podcast Interview\n\n\n\nDatabase Indexing\nB-Tree Index\nGIN Index\nGIST Index\nSTE Energy\nRedis\nGraphite\nPrometheus\npg_prometheus\nOpenMetrics Standard Proposal\nTimescale Parallel Copy\nHadoop\nPostGIS\nKDB+\nDevOps\nInternet of Things\nMongoDB\nElastic\nDataBricks\nApache Spark\nConfluent\nNew Enterprise Associates\nMapD\nBenchmark Ventures\nHortonworks\n2σ Ventures\nCockroachDB\nCloudflare\nEMC\nTimescale Blog: Why SQL is beating NoSQL, and what this means for the future of data\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"As communications between machines become more commonplace the need to store the generated data in a time-oriented manner increases. The market for timeseries data stores has many contenders, but they are not all built to solve the same problems or to scale in the same manner. In this episode the founders of TimescaleDB, Ajay Kulkarni and Mike Freedman, discuss how Timescale was started, the problems that it solves, and how it works under the covers. They also explain how you can start using it in your infrastructure and their plans for the future.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"TimescaleDB: Fast and Scalable Timeseries On PostGreSQL (Interview)","date_published":"2018-02-11T11:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/78092640-4638-4558-bd2b-a64c02d37f3a.mp3","mime_type":"audio/mpeg","size_in_bytes":49150487,"duration_in_seconds":3760}]},{"id":"podlove-2018-02-04t03:19:42+00:00-7126415d91bf89d","title":"Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17","url":"https://www.dataengineeringpodcast.com/pulsar-fast-and-scalable-messaging-with-rajan-dhabalia-and-matteo-merli-episode-17","content_text":"Summary\n\nOne of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have been popular for many years, and recently stream oriented systems such as Kafka have been rising in prominence. This week Rajan Dhabalia and Matteo Merli discuss the work they have done on Pulsar, which supports both options, in addition to being globally scalable and fast. They explain how Pulsar is architected, how to scale it, and how it fits into your existing infrastructure.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nA few announcements:\n\nThere is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20%\nThe O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20%\nIf you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.\n\n\n\nYour host is Tobias Macey and today I’m interviewing Rajan Dhabalia and Matteo Merli about Pulsar, a distributed open source pub-sub messaging system\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you start by explaining what Pulsar is and what the original inspiration for the project was?\nWhat have been some of the most challenging aspects of building and promoting Pulsar?\nFor someone who wants to run Pulsar, what are the infrastructure and network requirements that they should be considering and what is involved in deploying the various components?\nWhat are the scaling factors for Pulsar and what aspects of deployment and administration should users pay special attention to?\nWhat projects or services do you consider to be competitors to Pulsar and what makes it stand out in comparison?\nThe documentation mentions that there is an API layer that provides drop-in compatibility with Kafka. Does that extend to also supporting some of the plugins that have developed on top of Kafka?\nOne of the popular aspects of Kafka is the persistence of the message log, so I’m curious how Pulsar manages long-term storage and reprocessing of messages that have already been acknowledged?\nWhen is Pulsar the wrong tool to use?\nWhat are some of the improvements or new features that you have planned for the future of Pulsar?\n\n\nContact Info\n\n\nMatteo\n\nmerlimat on GitHub\n@merlimat on Twitter\n\n\n\nRajan\n\n\n@dhabaliaraj on Twitter\nrhabalia on GitHub\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nPulsar\nPublish-Subscribe\nYahoo\nStreamlio\nActiveMQ\nKafka\nBookkeeper\nSLA (Service Level Agreement)\nWrite-Ahead Log\nAnsible\nZookeeper\nPulsar Deployment Instructions\nRabbitMQ\nConfluent Schema Registry\n\nPodcast Interview\n\n\n\nKafka Connect\nWallaroo\n\n\nPodcast Interview\n\n\n\nKinesis\nAthenz\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"One of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have been popular for many years, and recently stream oriented systems such as Kafka have been rising in prominence. This week Rajan Dhabalia and Matteo Merli discuss the work they have done on Pulsar, which supports both options, in addition to being globally scalable and fast. They explain how Pulsar is architected, how to scale it, and how it fits into your existing infrastructure.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Fast, Globally Scalable Data Streaming with Pulsar (Interview)","date_published":"2018-02-03T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/290d7faf-865c-4170-a5c1-b0df56e2acc0.mp3","mime_type":"audio/mpeg","size_in_bytes":37072519,"duration_in_seconds":3226}]},{"id":"podlove-2018-01-29t02:19:14+00:00-646dfc6a5756548","title":"Dat: Distributed Versioned Data Sharing with Danielle Robinson and Joe Hand - Episode 16","url":"https://www.dataengineeringpodcast.com/dat-with-danielle-robinson-and-joe-hand-episode-16","content_text":"Summary\nSharing data across multiple computers, particularly when it is large and changing, is a difficult problem to solve. In order to provide a simpler way to distribute and version data sets among collaborators the Dat Project was created. In this episode Danielle Robinson and Joe Hand explain how the project got started, how it functions, and some of the many ways that it can be used. They also explain the plans that the team has for upcoming features and uses that you can watch out for in future releases.\nPreamble\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nContinuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nA few announcements:\n\nThere is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20%\nThe O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20%\nIf you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.\n\n\nYour host is Tobias Macey and today I’m interviewing Danielle Robinson and Joe Hand about Dat Project, a distributed data sharing protocol for building applications of the future\n\nInterview\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat is the Dat project and how did it get started?\nHow have the grants to the Dat project influenced the focus and pace of development that was possible?\n\nNow that you have established a non-profit organization around Dat, what are your plans to support future sustainability and growth of the project?\n\n\nCan you explain how the Dat protocol is designed and how it has evolved since it was first started?\nHow does Dat manage conflict resolution and data versioning when replicating between multiple machines?\nOne of the primary use cases that is mentioned in the documentation and website for Dat is that of hosting and distributing open data sets, with a focus on researchers. How does Dat help with that effort and what improvements does it offer over other existing solutions?\nOne of the difficult aspects of building a peer-to-peer protocol is that of establishing a critical mass of users to add value to the network. How have you approached that effort and how much progress do you feel that you have made?\nHow does the peer-to-peer nature of the platform affect the architectural patterns for people wanting to build applications that are delivered via dat, vs the common three-tier architecture oriented around persistent databases?\nWhat mechanisms are available for content discovery, given the fact that Dat URLs are private and unguessable by default?\nFor someone who wants to start using Dat today, what is involved in creating and/or consuming content that is available on the network?\nWhat have been the most challenging aspects of building and promoting Dat?\nWhat are some of the most interesting or inspiring uses of the Dat protocol that you are aware of?\n\nContact Info\n\nDat\n\ndatproject.org\nEmail\n@dat_project on Twitter\nDat Chat\n\n\nDanielle\n\nEmail\n@daniellecrobins\n\n\nJoe\n\nEmail\n@joeahand on Twitter\n\n\n\nParting Question\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\nLinks\n\nDat Project\nCode For Science and Society\nNeuroscience\nCell Biology\nOpenCon\nMozilla Science\nOpen Education\nOpen Access\nOpen Data\nFortune 500\nData Warehouse\nKnight Foundation\nAlfred P. Sloan Foundation\nGordon and Betty Moore Foundation\nDat In The Lab\nDat in the Lab blog posts\nCalifornia Digital Library\nIPFS\nDat on Open Collective – COMING SOON!\nScienceFair\nStencila\neLIFE\nGit\nBitTorrent\nDat Whitepaper\nMerkle Tree\nCertificate Transparency\nDat Protocol Working Group\nDat Multiwriter Development – Hyperdb\nBeaker Browser\nWebRTC\nIndexedDB\nRust\nC\nKeybase\nPGP\nWire\nZenodo\nDryad Data Sharing\nDataverse\nRSync\nFTP\nGlobus\nFritter\nFritter Demo\nRotonde how to\nJoe’s website on Dat\nDat Tutorial\nData Rescue – NYTimes Coverage\nData.gov\nLibraries+ Network\nUC Conservation Genomics Consortium\nFair Data principles\nhypervision\nhypervision in browser\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n Click here to read the unedited transcript…\n Tobias Macey 00:13\n Hello and welcome to the data engineering podcast the show about modern data management. When you’re ready to launch your next project, you’ll need somewhere to deploy it, you should check out Linotype data engineering podcast.com slash load and get a $20 credit to try out there fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to date engineering podcast com to subscribe to the show. Sign up for the newsletter read the show notes and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes or Google Play Music, tell your friends and co workers and share it on social media. I’ve got a couple of announcements before we start the show. There’s still time to register for the O’Reilly strata conference in San Jose, California how from March 5 to the eighth. Use the link data engineering podcast.com slash strata dash San Jose to register and save 20% off your tickets. The O’Reilly AI conference is also coming up happening April 29. To the 30th. In New York, it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to data engineering podcast.com slash AI con dash new dash York to register and save 20% off the tickets. Also, if you work with data or want to learn more about how the projects you have heard about on the show get used in the real world, then join me at the Open Data Science Conference happening in Boston from May 1 through the fourth. It has become one of the largest events for data scientists, data engineers and data driven businesses to get together and learn how to be more effective. To save 60% of your tickets go to data engineering podcast.com slash o d s c dash East dash 2018 and register. Your host is Tobias Macey. And today I’m interviewing Danielle Robinson and Joe hand about the DAP project the distributed data sharing protocol for building applications of the future. So Danielle, could you start by introducing yourself? Sure.\n Danielle Robinson 02:10\n My name is Danielle Robinson. And I’m the CO executive director of code for science and society, which is the nonprofit that supports that project. I’ve been working on debt related projects first as a partnerships director for about a year now. And I’m here with my colleague, Joe hand, take it away, Joe.\n Joe Hand 02:32\n Joe hand and I’m the other co executive director and the director of operations at code for science and society. And I’ve been a core contributor for about two years now.\n Tobias Macey 02:42\n And Danielle, starting with you again, can you talk about how you first got involved and interested in the area of data management? Sure.\n Danielle Robinson 02:48\n So I have a PhD in neuroscience. I finished that about a year and a half ago. And what I did during my PhD, my research was focused on cell biology Gee, really, without getting into the weeds too much on that a lot of time microscopes collecting some kind of medium sized aging data. And during that process, I became pretty frustrated with the academic and publishing systems that seemed to be limiting the access of access of people to the results of taxpayer funded research. So publications are behind paywalls. And data is either not published along with the paper or sometimes is published but not well archived and becomes inaccessible over time. So sort of compounding this traditionally, code has not really been thought of as an academic, a scholarly work. So that’s a whole nother conversation. But even though these things are changing data and code aren’t shared consistently, and are pretty inconsistently managed within labs, I think that’s fair to say. So and what that does is it makes it really hard to reproduce or replicate other people’s research, which is important for the scientific process. So during my PhD, I got really active in the open con and Mozilla science communities, which I encourage your listeners to check out. These communities build inter interdisciplinary connections between the open source world and open education, open access and open data communities. And that’s really important to like build things that people will actually use and make big cultural and policy changes that will make it easier to access research and share data. So it sort of I got involved, because of the partly because of the technical challenge. But also I’m interested in the people problems. So the changes to the incentive structure and the culture of research that are needed to make data management better on a day to day and make our research infrastructure stronger and more long lasting.\n Tobias Macey 04:54\n And Joe, how did you get involved in data management?\n Joe Hand 04:57\n Yeah, I’ve sort of gone back and forth between the sort of more academic or research a management and more traditional software side. So I really got started involved in data management when I was at a data visualization agency. And we basically built, you know, pretty web based visualization, interactive visualizations, for variety clients. This was cool, because it sort of allowed me to see like a large variety of data management techniques. So there was like the small scale, spreadsheet and manually updating data and spreadsheets, and then sending that off to visualize and to like, big fortune 500 companies that had data warehouses and full internal API’s that we got access to. So it’s really cool to see that sort of variety of, of data collection and data usage between all those organizations. So that was also good, because it, it sort of helped me understand how how to use data effectively. And that really means like telling a story around it. So you know, in order to sort of use data, you have to either use some math or some visual representation and the best the best stories around data combined, sort of bit of both of those. And then from there, I moved to a Research Institute. And we were tasked with building a data platform for international NGO. And they that group basically does census data collection in slums all over the world. And so as a research group, we were sort of trying interested in in using that data for research, but we also had to help them figure out how to collect that data. So before we came in with that project, they’d basically doing 30 years of data collection on paper, and then simulate sometimes manually entering that data into spreadsheets, and then trying to sort of share that around through thumb drives or Dropbox or sort of whatever tools they had access to. So this was cool, because it really gave me a great opportunity to see the other side of data management and analysis. So, you know, we work with the corporate clients, which sort of have big, lots of resources and computer computer resources and cloud servers. And this was sort of the other side where there’s, there’s very few resources, most of the data analysis happens offline. And a lot of the data transfer happens offline. So it was really cool to an interesting to see that, that a lot of the tools I’d been taking for granted sort of weren’t, couldn’t be applied in those in those areas. And then on the research side of things, I saw that, you know, as scientists and governments, they were just sort of haphazardly organizing data in the same way. So I was sort of trying to collect and download census data from about 30 countries. And we had to email right fax people, we got different CDs and paper documents and PDFs and other languages. So that really illustrated that there’s like a lot of data manage out there in a way that that I wasn’t totally familiar with. And it’s just, it’s just very crazy how everybody manages their data in different way. And that’s sort of a long, what I like to call the long tail of data management. So people that don’t use sort of traditional databases or manage it in their sort of unique ways. And most people managing day that in that way, you probably wouldn’t call it data, but it’s just sort of what they use to get their job done. And so once I started to sort of look at alternatives to managing that research data, I found that basically, and was hooked and started to contribute. So that’s sort of how I found that.\n Tobias Macey 08:16\n So that leads us nicely into talking about what the project is. And as much of the origin story each of you might be aware of. And Joe, you already mentioned how you got involved in the project. But Danielle, if you could also share your involvement or how you got started with it as well,\n Danielle Robinson 08:33\n yeah, I can tell the origin story. So the DAP project is an open source community building a protocol for peer to peer data sharing. And as a protocol, it’s similar to HTTP and how the protocols used today, but that adds extra security and automatic provisioning, and allows users to connect to a decentralized network in a decentralized network. You can store the data anywhere, either in a cloud or in a local computer, and it does work offline. And so data is built to make it easy for developers to build decentralized applications without worrying about moving data around and the people who originally developed it. And that’ll be Mathias, and Max and Chris, they’re scratching their own itch for building software to share and archive public and research data. And this is how Joe got involved, like he was saying before. And so it originally started as an open source project. And then that got a grant from the Knight Foundation in 2013, as a prototype grant focusing on government data, and then that was followed up in 2014, by a grant from the Alfred P. Sloan Foundation, and that grant focus more on scientific research and allowed the project to put a little more effort into working with researchers. And since then, we’ve been working to solve research data management problems by developing software on top of the debt protocol. And the most recent project is funded by the Gordon and anymore foundation. And now, that project started 2016. And that supports us it’s called debt in the lab, and I can get you a link to it on our blog. It supports us to work with California Digital Library and research groups in the University of California system to make it easier to move files around version data sets from support researchers through automating archiving. And so that’s a really cool project, because we get to work directly with researchers and do the kind of participatory design software stuff that we enjoy doing and create things that people will actually use. And we get to learn about really exciting research very, very different from the research, I did my PhD, one of the labs were working with a study see star Wasting Disease. So it’s really fascinating stuff. And we get to work right with them to make things that we’re going to fit into their workflows. So I started working with that, in the summer, right before that grant was funded. So I guess maybe six month before that grant was funded. And so I was came on as a consultant initially to help write grants and start talking about how to work directly with researchers and what to build that researchers will really help them move their data around and version control it. So So yeah, that’s how I became involved. And then in the fall, I transitioned to a partnerships position, and then the ED position in the last month.\n Tobias Macey 11:27\n And you mentioned that a lot of the sort of boost to the project has come in the form of grants from a few different foundations. So I’m wondering if you can talk a bit about how those different grants have influenced the focus and pace of the development that was possible for the project?\n Joe Hand 11:42\n Yeah, I mean, that really occupies a unique position in the open source world with that grant funding. So you know, for the first few years, it was closer to sort of a research project than a traditional product focused startup and other projects, other open source projects like that might be done part time as a side project, or just sort of for fun. But the grant funding really allowed the original developers to sign on and work full time, really solving harder problems that they might might be able to otherwise. So since we sort of got those grants, we’ve been able to toe the line between more user facing product and some research software. And the grant really gave us opportunity to, to tow that line, but also getting a field and connect with researchers and end users. So we can sort of innovate in with technical solutions, but really ground those real in reality with with specific scientific use cases. So you know, this balances really only possible because of that grant funding, which sort of gives us more flexibility and might have a little longer timeline than then VC money or or just like a open source, side project. But now we’re really at a critical juncture, I’d say we’re grant funding is not quite enough to cover what we want to do. But we’re lucky the protocol is really getting in a more stable position. And we’re starting to, to look at those user facing products on top and starting to build those those around around the core protocol.\n Tobias Macey 13:10\n And the fact that you have received so many different rounds of grant funding, sort of lends credence to the fact that you’re solving a critical problem that lots of people are coming up against. And I’m wondering if there are any other projects or companies or organizations that are trying to tackle similar or related problems that you sort of view as co collaborators or competitors in the space? Where do you think that the DAP project is fairly uniquely positioned to solve the specific problems that it’s addressing?\n Joe Hand 13:44\n Yeah, I mean, I would say we have, you know, there are other similar use cases and tools. And you know, a lot of that is around sharing open data sets, and sort of that the publishing of data, which Daniel might be able to talk more about, but on the on the sort of technical side, there is, you know, other I guess the biggest competitor or similar thing might be I PFS, which is another sort of decentralized protocol for for sharing and, and storing data in different ways. But we’re really we’re actually, you know, excited to work with these various companies. So you know, I PFS is more of a storage focus format. So basically allows content based storage on a distributed network. And that’s really more about sort of the the transfer protocol and, and being very interoperable without all these other solutions. So yeah, you know, that’s what we’re more excited about it is trying to understand how we can how we can use that in collaboration with all these other groups. Yeah,\n Danielle Robinson 14:41\n I think I’m just close one, what Joe said, through my time coming up in the open con community and the Mozilla science community, there are a lot of people trying to improve access to data broadly. And I, most of the people, I know everyone in the space really takes collaboration, not competition, sort of approach, because there are a lot of different ways to solve the problem, depending on who what the end user wants. And there are there’s a lot of great projects working in the space. I would agree with Joe, I guess that IP address is the thing that people sometimes you know, like I’ll be at a an event and someone will say, what’s the difference between detonate, PFS, and I answered pretty much how judges answered. But it’s important to note that we know those people, and we have good relationships with them. And we’ve actually just been emailing with them about some kind of collaboration over the next year. So it’s there’s a lot of there’s a lot of really great projects in the open data and improving access to data space. And I basically support them all. So hopefully, there’s so much work to be done that I think there’s room for all the people in the space.\n Tobias Macey 15:58\n And now that you have a style, a nonprofit organization around that, are there any particular plans that you have to support future sustainability and growth for the project?\n Danielle Robinson 16:09\n Yes, future sustainability and growth for the project is what we wake up and think about every day, sometimes in the middle of the night. That’s the most important thing. And incorporating the nonprofit was a big step that happened, I think, the end of 2016. And so it’s critical as we move towards a self sustaining future. And importantly, it will also allow us to continue to support and incubate other open source projects in the space, which is something that I’m really excited about. For dat, our goal is to support a core group of top contributors through grants, revenue sharing, and donations. And so over the next 12 months will be pursuing grants and corporate donations, as well as rolling out an open collective page to help facilitate smaller donations, and continuing to develop products with an eye towards things that can generate revenue and support that idea that ecosystem at the same time, we’re also focusing on sustainability within the project itself. And what I mean by that is, you know, governance, community management. And so we are right now working with the developer community to formalize the technical process on the protocol through a working group. And those are really great calls, lots of great people are involved in that. And we really want to make sure that protocol decisions are made transparently. And it can involve a wider group of the community in the process. And we also want to make the path to participation, involvement and community leadership clear for newcomers. So by supporting the developer community, we hope to encourage like new and exciting implementations of the DAP protocol, some of the stuff that happened 2017, you know, from my perspective, working in the science and sort of came out of nowhere, and people are building, you know, amazing new social networks based on that. And it was really fun and exciting. And so just keeping the community healthy, and making sure that the the technical process and how decisions get made is really clear and transparent, I think was going to facilitate even more of that. And just another comment about being a nonprofit because code for science, and society is a nonprofit, we also act as a fiscal sponsor. And what that means is that like minded projects, who get grant funding that are not nonprofits, so they can’t accept the grant on their grant through us. And then we take a small percentage of that grant. And we use that to help those projects by linking them up with our community. I work with them on grant writing, and fundraising and strategy will support their own community engagement efforts and sometimes offer technical support. And we see this is really important to the ecosystem and a way to help smaller projects develop and succeed. So right now we do that with two projects. One of them is called sin Silla. And it can send a link for that. And the other one is called science fair. scintilla is an open source project predictable documents software funded by the Alfred P. Sloan Foundation. It’s looking to support researchers from data collection to document offering. And science fair is a peer to peer library built on data, which is designed to make it easy for scholars to curate collections of research on a certain topic, annotate them and share it with their colleagues. And so that project was funded by a prototype grant from a publisher called life. And they’re looking for additional funding. So we’re working with both of them. And in the first quarter of this year, Joe and I are working to formalize the process of how we work with these other projects and what we can offer them and hopefully, we’ll be in the position take on additional projects later this year. But I really enjoy that work. And I think, as someone so I went through the Mozilla fellowship, which was like a 10 month long, crazy period where Mozilla invested a lot in me and making sure I was meeting people and learning how to write grants and learning how to give good talks and all kinds of awesome investment. And so for a person who goes through a program like that, or a person who has a side project, there’s kind of there’s a need for groups in the space, who can incubate those projects, and help them as they develop from from the incubator stage to the, you know, middle stage before they scale up. So I thinking there’s, so as a fiscal sponsor, we were hoping to be able to support projects in that space.\n Tobias Macey 20:32\n And digging into the debt protocol itself. When I was looking through the documentation, it mentioned that the actual protocol itself is agnostic to the implementation. And I know that the current reference implementation is done in JavaScript. So I’m wondering if you could describe a bit about how the protocol itself is designed, how the reference implementation is done, and how the overall protocol has evolved since it was first started and what your approach is to version in the protocol itself to ensure that people who are implementing it and other technologies or formats are able to ensure that they’re compliant with specific versions of the protocol as it evolves.\n Joe Hand 21:19\n Yeah, so that’s basically a combination of ideas from from get BitTorrent, and just the the web in general. And so there are a few key properties in that, but basically, any implementation has to recreate. And those are content, integrity, decentralized mirroring of the data sets, network, privacy, incremental version, and then random access to the data. So we have a white paper that sort of explains all these in depth, but I’ll sort of explain how they work maybe in a basic use case. So let’s say I want to send some data to Danielle, which I do all the time. And I have a spreadsheet where I keep track of my coffee intake intake. So I want to live Danielle’s computer so she can make sure I’m not over caffeinated myself. So sort of similar to how you get started with get, I would put my spreadsheet in a folder and create a new dat. And so whenever I create a new debt, it makes a new key pair. So one the public key and was the private key. And the public key is basically the dat link, so kind of like a URL. So you can use that in any anything that speaks with the the DAP protocol. And you can just sort of open that up and look at all the files inside of that. And then the the private key allows me to write files to that. And it’s used to sign any of the new changes. And so the private key allows Danielle to verify that the changes actually came for me and that somebody else wasn’t, wasn’t trying to fake my data, or somebody wasn’t trying to man in the middle of my, my data when I was transferring it to Danielle. So I added my spreadsheet to the data. And then the date, what that does is break that file into little chunks. It hashes all those trunks and creates a Merkel tree with that. And that Merkel tree, basically has lots of cool properties is one of the key key sort of features of data. So the Merkel tree allows us to sparsely replicated data. So if we had a really big data set, and you only want one file, we can sort of use the Merkel tree to download one file and then still verify the integrity of that content with that incomplete data set. And the other part that allows us to do that is the register. So all the files are stored in one register, and all the metadata is stored in another register. And these registers are basically append only Ledger’s. They’re also sort of known as secure registers. Google has a project called certificate transparency, that has similar ideas. And these registers, basically, you pen, whenever new file changes, you might append that to the metadata register, and that register stories based permission about the structure of the file system, what version it is, and then any other metadata, like the creation time for the change time of that file. And so right now, you know, as you said, Tobias, we we sort of are very flexible on sort of how things are implemented. But right now we basically store the files as files. So that’s sort of allows for people to see the files normally and interact with them normally. But the cool part about that is that the the on disk file storage can be really flexible. So as long as the implementation has random access, basically, then they can store it in any different way. So we have, for example, a server edge store storage model built for the server that stores all of the files as a single file. So that sort of allows you to have less file descriptors open and sort of shut, gets the the file I O all constrained to one file. So once my file gets added, I can share my link privately with Danielle and I can send that over chat or something or just paste it somewhere. And then she can clone my dad on using our command line tool or the desktop tool or the beaker browser. And when she clones my dad, our computer is basically connect directly to each other. So we use a variety mechanisms to try and do that connection. That’s been one of the challenges that I can talk about later, sort of how to how to connect peer to peer and the challenges around that. But then once we do connect, will transfer the data either over TCP or UDP. So those are default network protocols that we use right now. But yeah, that can be as automated basically, on any other protocol. I think Mathias once said that, that if you could implement it over carrier pigeon, that would work fine, as long as you had a lot of pigeons. So we’re really open to sort of how how the data as far as the protocol, information gets transferred. And we’re working over a dat over HTTP implementation too. So this wouldn’t be peer to peer. But it would allow basically traditional server fallback if no peers or online or for services that don’t want to run a peer to peer for whatever reason, once Danielle clones my, she can open it just like a normal file and plug it into a bar or Python or whatever. And use her equation to measure my caffeine level. And then let’s say I drink another cup of coffee and update my spreadsheet, the changes will basically automatically be synced to her, as long as she’s still connected to me. And it will it will be synced throughout the network to anybody else that’s connected to me. So the meditate, meditate or register stores that updated file information. And then the content register stores just the change file blocks. So Danielle only have to sync the death of that content change rather than the whole dataset again. So this is really useful for the big data sets, you know, I think the whole thing. And yeah, we’ve had to design basically each of these pieces to be as modular as possible both within our JavaScript demo the implementation, but also in the protocol in general. So right now, developers can swap other network protocols data storage. So for example, if you want to use that in the browser, you can use web RTC for the network and discovery and then use index DB for data storage. So index DB has random access. So you can just plug that in, directly into that. And we have some modules for those. And that should be working. We did have a web RTC implementation we were supporting for a while, but we found it a bit inconsistent for our use cases, which is, you know, more around like large file sharing. But it’s still might be okay for for chat and other more text based things. So, yeah, all of our implementations in Node right now.\n I think that was that was both for, for usability and developer friendliness, and also just being able to work in the browser and across platforms. So we can distribute a binary now of that pretty easily. And you can run that in the browser or build dad tools on electron. So it sort of allows a wide range of, of developer tools built on top of that. But we have a few community members now working on different implementations and rust and see I think are the two, the two that are going right now. And so as far as the the protocol version in, that was actually one of the big conversations we were having in the last working group meeting. And that’s to be decided, basically, but through through the stages we’ve gone through, we’ve broken it quite a few times. And now we’re finally in a place where we we want to make sure not to break it moving forward. So there’s sort of space in the protocol for information like version history, or version of the protocol. So we’ll probably use that to signal the version and just figure out how, how the tools that are implementing it can fall back to the latest version. So before, before all the sort of file based stuff that went through a different a few different stages, it started really as a more like version, decentralized database. And then as as Max and Mathias and Krista sort of moved to the scientific use cases where they sort of removed more and more of the database architecture as it as it moved on and matured. So we basically, that transition was really driven by like user feedback and watching her researchers work. And we realized that so much of research data is still kept in files and basically moved manually between machines. So even if we were going to build like a special database, a lot of researchers still won’t be able to use that, because that sort of requires more more infrastructure than there they have time to support. So we really just kept working to build a general purpose solution that allows other people to build tools to solve those, those more specific problems. And the last point is that right now, all that transfer is basically one way so only one person can update the source. This is really useful for a lot of our research escape research cases where they’re getting data from lab equipment, where there’s like a specific source, and you just want to disseminate that information to various computers. But it really doesn’t work for collaboration. So that’s sort of the next thing that we’re working on. But we really want to make sure to solve, solve this sort of one way problem before we move to the harder problem of collaborative data sets. And this last major iteration is sort of the hardest. And that’s what we’re working in right now. But it’s sort of allows multiple users to write to the same that. And with that, we sort of get into problems like conflict resolution and, and duplicate updates and other other sort of harder distributed computing problems.\n Tobias Macey 30:24\n And that partially answers one of the next questions I had, which was to ask about conflict resolution. But if there’s only one source that’s allowed to update the information, then that solves a lot of the problems that might arise by sinking all these data sets between multiple machines, because there aren’t going to be multiple parties changing the data concurrently. So you don’t have to worry about how to handle those use cases. And another question that I had from what you were talking about is the cryptography aspect of that sounds as though when you initialize the data, it just automatically generates the pressure private key. And so that private key is chronically linked with that particular data set. But is there any way to use for instance, Coinbase or jpg, to sign the source that in addition to the generated key to establish your identity for some for when you’re trying to share that information publicly? And not necessarily via some channel that already has established trust?\n Joe Hand 31:27\n Yeah, I mean, you can sort of so once, I mean, you could, like do that within the that. We don’t really have any mechanism for doing that on top of that. So it’s, you know, we’re sort of going to throw that into user land right now. But, yeah, I mean, that’s a good good question. And we’ve we’ve had some people, I think, experimenting with different identity systems and and how to solve that problem. And I think we’re pretty excited about the, the new wire app, because that’s open source, and it uses end to end encryption and has some identity system and are sort of trying to see if we can sort of build that on top of wire. So that’s, that’s one of the things that we’re sort of experimenting with.\n Tobias Macey 32:09\n And one of the primary use cases that is mentioned in the documentation, and the website for that is being able to host and distribute open data sets with a focus being on researchers and academic use cases. So I’m wondering if you can talk some more about how that helps with that particular effort and what improvements it offers over some of the existing solutions that researchers were using prior\n Danielle Robinson 32:33\n there are solutions for both hosting and distributing data. And in terms of hosting and distribution. There’s a lot of great work, focused on data publication and making sure that data associated with publications is available online and thinking about the noto and Dryad or data verse. There are also other data hosting platforms such as see can or data dot world. And we really love the work these people do and we’ve collaborated with some of them are were involved in like, the organization of friendly org people life for the open source Alliance for open scholarship has some people from Dryad who are involved in it. And so it’s nice to work with them. And we’d love to work with them to use that to upload and distribute data. But right now, if researchers need to feed if researchers need to share files between many machines and keep them updated, and version, so for example, if there’s a large live updating data set, there really aren’t great solutions to address data version and sharing. So in terms of sharing, transferring lots of researchers still manually copy files between machines and servers, or use tools like our sink or FTP, which is how I handled it during my PhD. Other software such as Globus or even Dropbox box can require more IT infrastructure than small research group may have researchers like you know, they are all operating on limited grant funding. And they also depend on the it structure of their institution to get them access to certain things. So a researcher like me might spend all day collecting a terabyte of data on a microscope and then wait for hours or wait overnight to move it to another location. And the ideal situation from a data management perspective is that those raw data are automatically archived to the web server and sent to the researchers computer for processing. So you have an archived copy of the raw data that came off of the equipment. And in the process, files also need to be archived. So you need archives of the imaging files, in this case at each step in processing. And then when a publication is ready, the data processing pipeline, in order for it to be fully reproducible, you’ll need the code and you’ll need the data at different stages. And even without access to to compete, the computer, the cluster where the analysis was done, a person should be able to repeat that. And I say ideally, because this isn’t really how it’s happening. Now.\n archiving data, a different steps can be the some of the things that stop that from happening, or just cost of storage, and the availability of storage and researcher habits. So I definitely, you know, know some researchers who kept data on hard drives and Tupperware to protect them in case the sprinklers ever went off, which isn’t really like a long term solution, true facts. So that can make on can automate these archiving steps at different checkpoints and make the backups easier for researchers. As a former researcher, I’m interested in anything that makes better data management automatic for researchers. And so we’re also interested in version computer environments to help labs avoid the drawer full of jobs tribes problem, which is sadly, a quote from a senior scientist who was describing a bunch of data collected by her lab that she can no longer access, she has the drawer, she has the jazz drives, she can’t get in them, that data is essentially lost. And so researchers are really motivated to make sure when things are archived, they’re archived in a forum where they can actually be accessed. But I think, because researchers are so busy, it’s really hard to know like, when that is, so I think because we’re so focused on essentially like filling in the gaps between the services that researchers use, and it worked well for them and automating things, I think that that’s in a really good position to solve some of these problems. And if you have, you know, some of the researchers that we’re working with now, I’m thinking of one person who has a large data set and bioinformatics pipeline, and he’s at a UC lab, and he wants to get all the information to his closet right here in Washington State. And it’s taken months, and he has not been able to do it or he can get he can’t, he just can’t move that data across institutional lines. So and that’s a much longer conversation as to like why exactly that isn’t working. But we’re working with him to try to just make him make it possible for him to move the data and create a version iteration or a version emulation of his compute environment so that his collaborator can just do what he was doing and not need to spend four months worrying about dependencies and stuff. So yeah, hopefully, that’s the question.\n Tobias Macey 37:39\n And one of the other difficult aspects of building a peer to peer protocol is the fact that in order for there to be sufficient value in the protocol itself is there needs to be a network behind it of people to be able to share that information with and share the bandwidth requirements for being able to distribute that in front. So I’m wondering how you have approached the effort of building up that network, and how much progress you feel you have made in that effort?\n Joe Hand 38:08\n Yeah, I’m not sure we really view that as as that traditional peer to peer protocol, I’m using that model sort of relying on on network effects to scale. So you know, as Danielle said, we’re just trying to get data from A to B. And so our critical mass is basically to users on a given data set. So obviously, we want to first build something that offers better tools for those to users over traditional cloud or client server model. So if I’m transferring files to another researcher using Dropbox, you know, we have to transfer files via a third party and a third computer before it can get to the other computer. So rather than going direct between two computers, we have to go through a detour. And this has implications for speed, but also security bandwidth usage, and even something like energy usage. So by cutting off at their computer, we feel like we’re we’re already about adding value to the network, we’re sort of hoping that when when researchers that are doing this HDB transfer, they they can sort of see the value of going directly. And and using something that is version and can like be life synced over existing tools, like our st corrected E or, or the commercial services that might store data in the cloud. And you know, we really don’t have anything against the centralized services, we sort of recognize that they’re very useful sometimes. But they, they also aren’t the answer to everything. And so depending on the use case, decentralized system might make more sense than a centralized one. And so we sort of want to offer developer and users that option to make that choice, which we don’t really have right now. But in order to do that, we really have to start with peer to peer tools first. And then once we have that decentralized network, we can basically limit the network to one server peer in many clients, and then all of a sudden, it’s centralized. So we sort of understand that, that it’s easy to go from the centralized, decentralized, but it’s harder to go the other way around, we sort of have to start with a peer to peer network in order to solve all these different problems. And the other thing is that we sort of know, file systems are not going away. We know that that web browsers will continue to support static files. And we also know that people will basically want to move these things between computers, back them up, archive them, share them two different computers. So we sort of know files are going to be transferred a lot in the future. And that’s something we we can, we can depend on. And they probably even want to do this in a secure way sometimes, and maybe in an offline environment or a local network. And so we’re basically trying to build from that those basic principles, using sort of peer to peer transfer is the sort of bedrock of all that. And that’s sort of how we got to where we are now with the peer to peer network. But we’re not really worried that that we need a certain number of or critical mass of users to add value, because we just sort of feel like by building the right tools, with these principles, we can, we can start adding value, whether it’s a decentralized network or a centralized network.\n Tobias Macey 40:59\n And one of the other use cases that’s been built on top of that is being able to build websites and applications that can be viewed by a web browsers and distributed peer to peer in that manner. So I’m wondering how much uptake you’ve seen and usage for that particular application of the protocol? And how much development effort is being focused on that particular use case?\n Joe Hand 41:20\n Yeah, so you know, if I open my bigger browser right now, which is the main the main web implementation we have that Paul frizzy and Tara Bansal are working on, you know, if I open my my bigger browser, I think I usually have 50, to 100, or sometimes 200, peers that I connected right away. So that’s through some of the social network copies, like, wrote on their freighter, and then just some, like personal sites. And you know, we’ve sort of been working with the beaker browser folk probably for two years now. Sort of CO developing the protocol and, and seeing what they need support for in beaker. But you know, it sort of comes back comes back to that basic Brynn pull that we can recognize that a lot of websites are static files. And if we can just sort of support static files in the best way possible, then you can browse a lot of websites. And that even gives you the benefit of things that are more interactive, we know that they have to be developed. So they work offline, too. So both Cortana and Twitter can work offline. And then once you get back online, you can just sync the data sort of seamlessly. So that’s sort of the most exciting part about those.\n Danielle Robinson 42:29\n You mean, fritter not.\n freighter is the Twitter clone that Tara Bansal and Paul made beakers, a lot of fun. And if you’ve never played around with it, I would encourage you to download it. I think it’s just speaker browser calm. And I’m not a developer by trade. But I have seriously enjoyed playing around on beaker. And I think the some of the more frivolous things like printer that have come out of it are a lot of fun, and really speak to the potential of peer to peer networks in today’s era as people are becoming increasingly frustrated with the centralized platforms.\n Tobias Macey 43:13\n And the fact that the content that’s being distributed via that using the browser is primarily static in nature, I’m wondering how that affects the sort of architectural patterns that people are used to using with the common three tier architecture. And what are you’ve already mentioned, a couple of social network applications that have been built on top of it, but I’m wondering if there any others that are built on top of and delivered via that, that you’re aware of the you could talk about that speak to some of the ways that people are taking advantage of that in more of the consumer space?\n Joe Hand 43:47\n Yeah, I mean, I think, you know, one of the big shifts that have made this easier is having databases in the browser, so things like index DB or other local storage databases, and then be able to sync those two other computers. So as long as you sort of know that, I’m writing to my database, and that, you know, if I’m writing my, I think people are trying to build games off this. So you know, you could build a chess game where I write to my local database, and then you have some logic for determining if a move is valid or not, and then sinking that to your competitor, you know, it sort of provides, it’s a more constrained environment. But I think that also gives you a benefit of, of sort of being able to constrain your development and, and not requiring these external services or external database calls or whatever. I know that I’ve tried a few times to sort of develop projects are just like fun little things. And it is a challenge, it’s a challenge, because you sort of have to think differently, how those things work, and you can’t rely necessarily on on external services, you know, whether that’s something as simple as like, loading fonts from external service, or CSS styles or whatever, external JavaScript, you sort of want that all to be packaged within one, one day, if you want to ensure it’s all going to work. So it’s def has, you know, you think of a little differently even on those those simple things. But yeah, it does constrain the sort of bigger applications. And, you know, I think the other area that that we could see development is more in electron applications. So maybe not in beaker, but electron, using that sort of framework as as a platform for other types of applications that might need those more sort of flexible models. So science fair, which is one of our hosted projects, is a really good example of how, how to use that in a way to distribute data, but still sort of have a full application. So basically, you can distribute all the data for the application over that and keep it updated through the live sinking. And users can basically download the the PDFs that they need to read, or the journals or the figures they want to read. And just download whatever they want sort of allowing developers to have that flexible model where you can distribute things peer to peer and have both the live sinking, but also just downloading whatever data that users need, and just providing that framework for, for that data management.\n Tobias Macey 46:15\n And one of the other challenges that’s posed, particularly for this public distribution, use case is that content discovery, because the By default, the URLs that are generated, are private, and ungraspable, because they’re essentially just hashes of the content. So I’m wondering if there are any particular mechanisms that you either have built or planned or started discussing for being able to facilitate content discovery of the information that’s being distributed by these different networks?\n Joe Hand 46:50\n Yeah, this is definitely an open question. I sort of fall back on my comment answer, which is depends on the the tool that we’re using and the different communities and there’s going to be different approaches, some might be more decentralized, and some might be centralized. So, for example, with data set discovery, you know, there’s a lot of good centralized services for data set publishing, as Daniel mentioned, like pseudo or data verse. So these are places that already have discovery engines, I guess we’ll say, and they published data sets. So you know, you could sort of similarly publish that URL along with those those data sets so that people could sort of have an alternative way to download those data sets. So that’s, that’s sort of one way that we’ve been thinking about discovery is sort of leveraging these existing solutions that are doing a really good job in their domain, and trying to work with them to start using that for their their data management. Another sort of hacky solution, I guess I’ll say is using existing domains and DNS. So basically, you can publish a regular HTTP site on your URL, and give it a specific well known file, and that points to your debt address. And then the baker browser can find that file and tell you that a peer to peer version of that site is available. So we’re basically leveraging the existing DNS infrastructure to start to discover content just with existing URLs. And I think a lot of the discovery will be more community based. So in, for example, fritter in rotund people are starting to build crawlers or search bots, to discover users or search and so basically, just sort of looking at where there is need, and identifying, you know, different types of crawlers to build and, and how to connect those communities in different ways. So we’re really excited to see what what ideas pop in that in that area. And they’ll probably come in in a decentralized way, we hope.\n Tobias Macey 48:46\n And for somebody who wants to start using that what is involved in creating and or consuming the content that’s available on the network, or if there any particular resources that are available to get somebody up to speed and understand how it works and some of the different uses that they could put it to?\n Danielle Robinson 49:05\n Sure, I can take that. And Joe just chime in. If you think of anything else, we built a tutorial for our work with the labs and for Ma’s fest this year that’s at try dash calm. And this tutorial takes you through how to work with the command line tool and some basics about beaker. And please tell us if you find a bug, there may be bugs morning. But it was working pretty well when I use the last and it’s in the browser. And you can either share data with yourself it spins up a little virtual machine. So you can share data with yourself or you can do it with a friend and share data with your friend. So beakers also super easy for a user who wants to get started, you can visit pages of her dad just like you would a normal web page. For example, you can go to this website, and we’ll give Tobias the link to that. And just change the end PTP to dat. And so it looks like dat colon slash slash j handout space. And beaker also has this fun thing that lets you create a new site with a single click. And you can also fork sites and edit them and make your own copies of things, which is fun if you’re like learning about how to build several websites. So you can go to bigger browser calm and learn about that. And I think we’ve already talked about return and fritter. And we’ll add links into people who want to learn more about that. And then for data focused users, you can use that for sharing or transferring files, either with the desktop application or the command line interface. And so if you’re interested, we encourage you to play around the community is really friendly and helpful to new people. Joe and I are always on the IRC channel or on Twitter. So if you have questions, feel free to ask and we love talking to new people, because that’s how all the exciting stuff happens in this community. So\n Tobias Macey 50:58\n and what have been some of the most challenging aspects of building the project in the community and promoting the use cases and capabilities of the project,\n Danielle Robinson 51:10\n I can speak a little bit to promoting it in the academic research. So in academic research, probably similar to many of the industries where your listeners work, software decisions are not always made for entirely rational reasons. There’s tension between what your boss wants what the IT department has approved, that means institutional data security needs, and then the perceived time cost of developing a new workflow and getting used to a new protocol. So we try to work directly with researchers to make sure the things we build are easy and secure. But it is a lot of promotion and outreach to get their scientists to try a new workflow. They’re really busy. And the incentives are all you know, get more grants, do more projects, publish more papers. And so even if something will eventually make your life easier, it’s hard to sink in time up front. One thing I noticed, and this is probably common to all industries is that people will I’ll be talking to someone and they’ll say, Oh, you know, archiving the data from my research group is not a problem for me. And then they’ll proceed to describe a super problematic data management workflow. And it’s not a problem for them anymore, because they’re used to it. So it doesn’t hurt day to day. But you know, doing things like waiting until the point of publication, then try to go back and archive all the raw data, maybe someone was collected by a postdoc who’s now gone, other was collected by a summer student who used a non standard naming scheme for all the files, you know, there’s just a million ways that that stuff can go wrong. So for now, we’re focusing on developing real world use cases, and participating in you know, community education around data management. And we want to build stuff that’s meaningful for researchers and others who work with data. And we think that by working with people and doing the nonprofit thing, grants is going to be the way to get us there. God want to talk a little bit about building.\n Joe Hand 53:03\n Yeah, sure. So you know, in terms of building it, I mean, I haven’t done too much work on the core protocol. So I can’t say much around the difficult design decisions there. I’m the main developer on the command line tool. And the most of the challenging decisions, they’re all are about sort of user interfaces, not necessarily technical problems. And so as Danielle said, it’s sort of as much about people as it is around software and and those decisions. But I think, you know, one of the, the most challenging thing that we’ve run into a lot is, is basically network issues. So in the peer to peer network, you know, you have to figure out how to connect to peers directly in a network, they might not be supposed to do that. So I think a lot of that is from BitTorrent sort of making different institutions restrict peer to peer networking in different ways. And, and so we’re sort of having to fight that battle against these existing restrictions and trying to find out how these networks are restrictive, and how we can continue to have success in connecting peers directly rather than through through a third party server. And it’s funny because, or maybe not funny, but some of the strictest network, we found are actually in academic institutions. And so, you know, some, for example, one of the UC campuses, I think, we found out that computers can never connect directly to each other computers on that same network. So if we wanted to transfer data between two computers sitting right next to each other, we basically have to go through an external cloud server just to get it to the computer sitting right next to each other, or, you know, you suddenly like a hard drive, or a thumb drive or whatever. But you know, that sort of thing. All these different sort of network configurations, I think, is one of the hardest parts, both in terms of implementation. But also in terms of testing, since we can’t, we can’t like readily get into these UC campuses or sort of see what the, what the network setup is. So we’re sort of trying to create more tools around network scene and both testing networks in the wild, but also just sort of using virtual networks to test different different types of network setups and sort of leverage that those two things combined to try and get around around all these network connection issues. So yeah, I think, you know, I would love to ask Mathias to this question around the design decisions in terms of the core protocol. But, but I can’t really say much about that, unfortunately.\n Tobias Macey 55:29\n And are there any particularly interesting or inspiring uses of that, that you’re aware of that you’d like to share?\n Danielle Robinson 55:36\n Sure, I can share a couple of things that we were involved in. During last in January 2016, we were involved in the data rescue and libraries plus network community. And that was the movement to archive government funded research at trusted public institutions like libraries and archives. And as a part of that, we got to work with some of the really awesome people at California Digital Library, California Digital Library is really cool, because it is digital library with a mandate to preserve and archive and steward the data that’s produced in the UC system. And it supports the entire UC system. And the people are great. And so we worked with them to make the the first ever backup of data.gov in January of 2016. And I think my colleague had 40 terabytes of metadata sitting in his living room for a while as we were working up to the transfer. And so that was a really cool project. And it has produced a useful thing. And it’s sort of, you know, we got to work with some of the data.gov people to make that happen. And they, you know, they were like how, really, it has never been backed up, that it was a good time to do it. But believe it or not, it’s actually pretty hard to find funding for that work. And we have more work we’d like to do in that space. archiving copies of federally funded research at trusted institutions is a really critical step towards ensuring the long term preservation of the research that gets done in this country. So hopefully, 2018 will see those projects funded or new collaborations in that space. Also, it’s a fantastic community, because it’s a lot of really interesting librarians and archivists who have great perspective on long term data preservation, and I love working with them. So hopefully, we can do something else there. Then the other thing that I’m really excited about is the working on the data in the lab project working on the debt container. issue. And I don’t mind over a little over time. So I don’t know how much I shouldn’t go into this. But we’ve learned a lot about really interesting research. And so we’re working to develop a container based simulation of a Research Computing cluster, that can run on any machine or in the cloud. And then by creating a container that will include the complete software environment of the cluster, researchers across the UC system can quickly get analysis pipelines that they’re working on us usable in other locations. And this Believe it or not, is it is it big problem, I was sort of surprised when one researcher told me she had been working for four months to get a pipeline running at UC Merced said that had been developed at UCLA. And that’s like, you could drive back and forth between her said, and UCLA a bunch of times in four months. But it’s this little stuff that really slows research down. And so I’m really excited about the potential there. And we wrote, we’ve written a couple blog posts on that. So I can add the links to those blog posts and in the follow up.\n Joe Hand 58:36\n And I’d say the most novel use that I’m sort of excited about is called hyper vision. And it’s basically video streaming and built on that Mathias booth, one of the lead developers on that is prototyping sort of something similar with the Danish public TV. And they basically want to live stream their, their channels over the peer to peer network. So I’m excited about that, because I’d really love to get more public television and Public Radio distributing content, peer to peer, so we can sort of reduce their their infrastructure costs and hopefully, allow for for more of that great content to come out.\n Tobias Macey 59:09\n Are there any other topics that we didn’t discuss yet? What do you think we should talk about before we close out the show?\n Danielle Robinson 59:15\n Um, I think I’m feeling pretty good. What about you, Joe?\n Joe Hand 59:18\n Yeah, I think that’s it for me. Okay.\n Tobias Macey 59:20\n So for anybody who wants to keep up to date with the work you’re doing or get in touch, we’ll have you each add your preferred contact, excuse me, your preferred contact information to the show notes. And as a final question, to give the listeners something else to think about, from your perspective, what is the biggest gap in the tooling or technology that’s available for data management today?\n Joe Hand 59:42\n I’d say transferring files, which feels really funny to say that, but to me, it’s still a problem that’s not really well solved. Just how do you get files from A to B in a consistent and easy to use manner, especially want a solution that doesn’t really require a command line, and is still secure, and hopefully doesn’t go through a third party service. Because hopefully, that means it works offline. So a lot of what I saw in the sort of developing world is the need for data management that works offline. And I think that’s, that’s one of the biggest gaps that we don’t really address yet. So there are a lot of great data data management tools out there. But I think they sort of aimed more at data scientists or software focused users that might use manage databases or something like a dupe. But there’s really a ton of users out there that don’t really have tools. Indeed, and most of the world is still offline or with inconsistent internet and putting everything through the servers on the cloud isn’t really feasible. But the alternatives now require sort of careful data management and manual data management if you don’t want to lose all your data. So we really hope to find a good balance between those those two needs in those two use cases. Yeah.\n Danielle Robinson 01:00:48\n Plus one with Joe said, transferring files, it does feel funny to say that, but it is still a problem in a lot of industries, and especially where I come from in research science. And from my perspective, I guess the other issue is that, you know, the people problems are always as hard or harder than the technical problems. So if people don’t think that it’s important to share data or archive data, in an accessible and usable form, we could have the world’s best easy to use tool, and it wouldn’t impact the landscape or the accessibility of data. And similarly, if people are sharing data that’s not usable, because it’s missing experimental context, or it’s in a proprietary format, or because it’s shared under a restrictive license, it’s also not going to impact the landscape, or be useful to the scientific community or the public. So working to change, we want to build great tools. But I also want to work to change the incentive structure and research to ensure that good data management practices are rewarded. And so that data is shared in a usable form. That’s really key. And I’ll add a link in the show notes to the fair data principles, which means data should be fundable, testable, interoperable, and reusable, something that your listeners might want to check out if they’re not familiar with it. It’s a framework developed in academia. But I’m not sure actually how much impacts had outside of that sphere. But it would be interesting to talk to your listeners a little bit about that. And yeah, I’ll put my contact info in the show notes. And I’d love to connect with anyone and or answer any further questions about that, and what we’re going to try to do with coatings for science and society over the next year. So thanks a lot, Tobias, for inviting us.\n Tobias Macey 01:02:30\n Yeah, absolutely. Thank you both for taking the time out of your days to join me and talk about the work you’re doing. It’s definitely a very interesting project with a lot of useful potential. And so I’m excited to see where you go from now into the future. So thank you both for your time and I hope you enjoy the rest of your evening.\n Unknown Speaker 01:02:48\n Thank you. Thank you.\n Transcribed by https://otter.ai?utm_source=rss&utm_medium=rss\n \n\n\n","content_html":"Sharing data across multiple computers, particularly when it is large and changing, is a difficult problem to solve. In order to provide a simpler way to distribute and version data sets among collaborators the Dat Project was created. In this episode Danielle Robinson and Joe Hand explain how the project got started, how it functions, and some of the many ways that it can be used. They also explain the plans that the team has for upcoming features and uses that you can watch out for in future releases.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello and welcome to the data engineering podcast the show about modern data management. When you’re ready to launch your next project, you’ll need somewhere to deploy it, you should check out Linotype data engineering podcast.com slash load and get a $20 credit to try out there fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to date engineering podcast com to subscribe to the show. Sign up for the newsletter read the show notes and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes or Google Play Music, tell your friends and co workers and share it on social media. I’ve got a couple of announcements before we start the show. There’s still time to register for the O’Reilly strata conference in San Jose, California how from March 5 to the eighth. Use the link data engineering podcast.com slash strata dash San Jose to register and save 20% off your tickets. The O’Reilly AI conference is also coming up happening April 29. To the 30th. In New York, it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to data engineering podcast.com slash AI con dash new dash York to register and save 20% off the tickets. Also, if you work with data or want to learn more about how the projects you have heard about on the show get used in the real world, then join me at the Open Data Science Conference happening in Boston from May 1 through the fourth. It has become one of the largest events for data scientists, data engineers and data driven businesses to get together and learn how to be more effective. To save 60% of your tickets go to data engineering podcast.com slash o d s c dash East dash 2018 and register. Your host is Tobias Macey. And today I’m interviewing Danielle Robinson and Joe hand about the DAP project the distributed data sharing protocol for building applications of the future. So Danielle, could you start by introducing yourself? Sure.
\nMy name is Danielle Robinson. And I’m the CO executive director of code for science and society, which is the nonprofit that supports that project. I’ve been working on debt related projects first as a partnerships director for about a year now. And I’m here with my colleague, Joe hand, take it away, Joe.
\nJoe hand and I’m the other co executive director and the director of operations at code for science and society. And I’ve been a core contributor for about two years now.
\nAnd Danielle, starting with you again, can you talk about how you first got involved and interested in the area of data management? Sure.
\nSo I have a PhD in neuroscience. I finished that about a year and a half ago. And what I did during my PhD, my research was focused on cell biology Gee, really, without getting into the weeds too much on that a lot of time microscopes collecting some kind of medium sized aging data. And during that process, I became pretty frustrated with the academic and publishing systems that seemed to be limiting the access of access of people to the results of taxpayer funded research. So publications are behind paywalls. And data is either not published along with the paper or sometimes is published but not well archived and becomes inaccessible over time. So sort of compounding this traditionally, code has not really been thought of as an academic, a scholarly work. So that’s a whole nother conversation. But even though these things are changing data and code aren’t shared consistently, and are pretty inconsistently managed within labs, I think that’s fair to say. So and what that does is it makes it really hard to reproduce or replicate other people’s research, which is important for the scientific process. So during my PhD, I got really active in the open con and Mozilla science communities, which I encourage your listeners to check out. These communities build inter interdisciplinary connections between the open source world and open education, open access and open data communities. And that’s really important to like build things that people will actually use and make big cultural and policy changes that will make it easier to access research and share data. So it sort of I got involved, because of the partly because of the technical challenge. But also I’m interested in the people problems. So the changes to the incentive structure and the culture of research that are needed to make data management better on a day to day and make our research infrastructure stronger and more long lasting.
\nAnd Joe, how did you get involved in data management?
\nYeah, I’ve sort of gone back and forth between the sort of more academic or research a management and more traditional software side. So I really got started involved in data management when I was at a data visualization agency. And we basically built, you know, pretty web based visualization, interactive visualizations, for variety clients. This was cool, because it sort of allowed me to see like a large variety of data management techniques. So there was like the small scale, spreadsheet and manually updating data and spreadsheets, and then sending that off to visualize and to like, big fortune 500 companies that had data warehouses and full internal API’s that we got access to. So it’s really cool to see that sort of variety of, of data collection and data usage between all those organizations. So that was also good, because it, it sort of helped me understand how how to use data effectively. And that really means like telling a story around it. So you know, in order to sort of use data, you have to either use some math or some visual representation and the best the best stories around data combined, sort of bit of both of those. And then from there, I moved to a Research Institute. And we were tasked with building a data platform for international NGO. And they that group basically does census data collection in slums all over the world. And so as a research group, we were sort of trying interested in in using that data for research, but we also had to help them figure out how to collect that data. So before we came in with that project, they’d basically doing 30 years of data collection on paper, and then simulate sometimes manually entering that data into spreadsheets, and then trying to sort of share that around through thumb drives or Dropbox or sort of whatever tools they had access to. So this was cool, because it really gave me a great opportunity to see the other side of data management and analysis. So, you know, we work with the corporate clients, which sort of have big, lots of resources and computer computer resources and cloud servers. And this was sort of the other side where there’s, there’s very few resources, most of the data analysis happens offline. And a lot of the data transfer happens offline. So it was really cool to an interesting to see that, that a lot of the tools I’d been taking for granted sort of weren’t, couldn’t be applied in those in those areas. And then on the research side of things, I saw that, you know, as scientists and governments, they were just sort of haphazardly organizing data in the same way. So I was sort of trying to collect and download census data from about 30 countries. And we had to email right fax people, we got different CDs and paper documents and PDFs and other languages. So that really illustrated that there’s like a lot of data manage out there in a way that that I wasn’t totally familiar with. And it’s just, it’s just very crazy how everybody manages their data in different way. And that’s sort of a long, what I like to call the long tail of data management. So people that don’t use sort of traditional databases or manage it in their sort of unique ways. And most people managing day that in that way, you probably wouldn’t call it data, but it’s just sort of what they use to get their job done. And so once I started to sort of look at alternatives to managing that research data, I found that basically, and was hooked and started to contribute. So that’s sort of how I found that.
\nSo that leads us nicely into talking about what the project is. And as much of the origin story each of you might be aware of. And Joe, you already mentioned how you got involved in the project. But Danielle, if you could also share your involvement or how you got started with it as well,
\nyeah, I can tell the origin story. So the DAP project is an open source community building a protocol for peer to peer data sharing. And as a protocol, it’s similar to HTTP and how the protocols used today, but that adds extra security and automatic provisioning, and allows users to connect to a decentralized network in a decentralized network. You can store the data anywhere, either in a cloud or in a local computer, and it does work offline. And so data is built to make it easy for developers to build decentralized applications without worrying about moving data around and the people who originally developed it. And that’ll be Mathias, and Max and Chris, they’re scratching their own itch for building software to share and archive public and research data. And this is how Joe got involved, like he was saying before. And so it originally started as an open source project. And then that got a grant from the Knight Foundation in 2013, as a prototype grant focusing on government data, and then that was followed up in 2014, by a grant from the Alfred P. Sloan Foundation, and that grant focus more on scientific research and allowed the project to put a little more effort into working with researchers. And since then, we’ve been working to solve research data management problems by developing software on top of the debt protocol. And the most recent project is funded by the Gordon and anymore foundation. And now, that project started 2016. And that supports us it’s called debt in the lab, and I can get you a link to it on our blog. It supports us to work with California Digital Library and research groups in the University of California system to make it easier to move files around version data sets from support researchers through automating archiving. And so that’s a really cool project, because we get to work directly with researchers and do the kind of participatory design software stuff that we enjoy doing and create things that people will actually use. And we get to learn about really exciting research very, very different from the research, I did my PhD, one of the labs were working with a study see star Wasting Disease. So it’s really fascinating stuff. And we get to work right with them to make things that we’re going to fit into their workflows. So I started working with that, in the summer, right before that grant was funded. So I guess maybe six month before that grant was funded. And so I was came on as a consultant initially to help write grants and start talking about how to work directly with researchers and what to build that researchers will really help them move their data around and version control it. So So yeah, that’s how I became involved. And then in the fall, I transitioned to a partnerships position, and then the ED position in the last month.
\nAnd you mentioned that a lot of the sort of boost to the project has come in the form of grants from a few different foundations. So I’m wondering if you can talk a bit about how those different grants have influenced the focus and pace of the development that was possible for the project?
\nYeah, I mean, that really occupies a unique position in the open source world with that grant funding. So you know, for the first few years, it was closer to sort of a research project than a traditional product focused startup and other projects, other open source projects like that might be done part time as a side project, or just sort of for fun. But the grant funding really allowed the original developers to sign on and work full time, really solving harder problems that they might might be able to otherwise. So since we sort of got those grants, we’ve been able to toe the line between more user facing product and some research software. And the grant really gave us opportunity to, to tow that line, but also getting a field and connect with researchers and end users. So we can sort of innovate in with technical solutions, but really ground those real in reality with with specific scientific use cases. So you know, this balances really only possible because of that grant funding, which sort of gives us more flexibility and might have a little longer timeline than then VC money or or just like a open source, side project. But now we’re really at a critical juncture, I’d say we’re grant funding is not quite enough to cover what we want to do. But we’re lucky the protocol is really getting in a more stable position. And we’re starting to, to look at those user facing products on top and starting to build those those around around the core protocol.
\nAnd the fact that you have received so many different rounds of grant funding, sort of lends credence to the fact that you’re solving a critical problem that lots of people are coming up against. And I’m wondering if there are any other projects or companies or organizations that are trying to tackle similar or related problems that you sort of view as co collaborators or competitors in the space? Where do you think that the DAP project is fairly uniquely positioned to solve the specific problems that it’s addressing?
\nYeah, I mean, I would say we have, you know, there are other similar use cases and tools. And you know, a lot of that is around sharing open data sets, and sort of that the publishing of data, which Daniel might be able to talk more about, but on the on the sort of technical side, there is, you know, other I guess the biggest competitor or similar thing might be I PFS, which is another sort of decentralized protocol for for sharing and, and storing data in different ways. But we’re really we’re actually, you know, excited to work with these various companies. So you know, I PFS is more of a storage focus format. So basically allows content based storage on a distributed network. And that’s really more about sort of the the transfer protocol and, and being very interoperable without all these other solutions. So yeah, you know, that’s what we’re more excited about it is trying to understand how we can how we can use that in collaboration with all these other groups. Yeah,
\nI think I’m just close one, what Joe said, through my time coming up in the open con community and the Mozilla science community, there are a lot of people trying to improve access to data broadly. And I, most of the people, I know everyone in the space really takes collaboration, not competition, sort of approach, because there are a lot of different ways to solve the problem, depending on who what the end user wants. And there are there’s a lot of great projects working in the space. I would agree with Joe, I guess that IP address is the thing that people sometimes you know, like I’ll be at a an event and someone will say, what’s the difference between detonate, PFS, and I answered pretty much how judges answered. But it’s important to note that we know those people, and we have good relationships with them. And we’ve actually just been emailing with them about some kind of collaboration over the next year. So it’s there’s a lot of there’s a lot of really great projects in the open data and improving access to data space. And I basically support them all. So hopefully, there’s so much work to be done that I think there’s room for all the people in the space.
\nAnd now that you have a style, a nonprofit organization around that, are there any particular plans that you have to support future sustainability and growth for the project?
\nYes, future sustainability and growth for the project is what we wake up and think about every day, sometimes in the middle of the night. That’s the most important thing. And incorporating the nonprofit was a big step that happened, I think, the end of 2016. And so it’s critical as we move towards a self sustaining future. And importantly, it will also allow us to continue to support and incubate other open source projects in the space, which is something that I’m really excited about. For dat, our goal is to support a core group of top contributors through grants, revenue sharing, and donations. And so over the next 12 months will be pursuing grants and corporate donations, as well as rolling out an open collective page to help facilitate smaller donations, and continuing to develop products with an eye towards things that can generate revenue and support that idea that ecosystem at the same time, we’re also focusing on sustainability within the project itself. And what I mean by that is, you know, governance, community management. And so we are right now working with the developer community to formalize the technical process on the protocol through a working group. And those are really great calls, lots of great people are involved in that. And we really want to make sure that protocol decisions are made transparently. And it can involve a wider group of the community in the process. And we also want to make the path to participation, involvement and community leadership clear for newcomers. So by supporting the developer community, we hope to encourage like new and exciting implementations of the DAP protocol, some of the stuff that happened 2017, you know, from my perspective, working in the science and sort of came out of nowhere, and people are building, you know, amazing new social networks based on that. And it was really fun and exciting. And so just keeping the community healthy, and making sure that the the technical process and how decisions get made is really clear and transparent, I think was going to facilitate even more of that. And just another comment about being a nonprofit because code for science, and society is a nonprofit, we also act as a fiscal sponsor. And what that means is that like minded projects, who get grant funding that are not nonprofits, so they can’t accept the grant on their grant through us. And then we take a small percentage of that grant. And we use that to help those projects by linking them up with our community. I work with them on grant writing, and fundraising and strategy will support their own community engagement efforts and sometimes offer technical support. And we see this is really important to the ecosystem and a way to help smaller projects develop and succeed. So right now we do that with two projects. One of them is called sin Silla. And it can send a link for that. And the other one is called science fair. scintilla is an open source project predictable documents software funded by the Alfred P. Sloan Foundation. It’s looking to support researchers from data collection to document offering. And science fair is a peer to peer library built on data, which is designed to make it easy for scholars to curate collections of research on a certain topic, annotate them and share it with their colleagues. And so that project was funded by a prototype grant from a publisher called life. And they’re looking for additional funding. So we’re working with both of them. And in the first quarter of this year, Joe and I are working to formalize the process of how we work with these other projects and what we can offer them and hopefully, we’ll be in the position take on additional projects later this year. But I really enjoy that work. And I think, as someone so I went through the Mozilla fellowship, which was like a 10 month long, crazy period where Mozilla invested a lot in me and making sure I was meeting people and learning how to write grants and learning how to give good talks and all kinds of awesome investment. And so for a person who goes through a program like that, or a person who has a side project, there’s kind of there’s a need for groups in the space, who can incubate those projects, and help them as they develop from from the incubator stage to the, you know, middle stage before they scale up. So I thinking there’s, so as a fiscal sponsor, we were hoping to be able to support projects in that space.
\nAnd digging into the debt protocol itself. When I was looking through the documentation, it mentioned that the actual protocol itself is agnostic to the implementation. And I know that the current reference implementation is done in JavaScript. So I’m wondering if you could describe a bit about how the protocol itself is designed, how the reference implementation is done, and how the overall protocol has evolved since it was first started and what your approach is to version in the protocol itself to ensure that people who are implementing it and other technologies or formats are able to ensure that they’re compliant with specific versions of the protocol as it evolves.
\nYeah, so that’s basically a combination of ideas from from get BitTorrent, and just the the web in general. And so there are a few key properties in that, but basically, any implementation has to recreate. And those are content, integrity, decentralized mirroring of the data sets, network, privacy, incremental version, and then random access to the data. So we have a white paper that sort of explains all these in depth, but I’ll sort of explain how they work maybe in a basic use case. So let’s say I want to send some data to Danielle, which I do all the time. And I have a spreadsheet where I keep track of my coffee intake intake. So I want to live Danielle’s computer so she can make sure I’m not over caffeinated myself. So sort of similar to how you get started with get, I would put my spreadsheet in a folder and create a new dat. And so whenever I create a new debt, it makes a new key pair. So one the public key and was the private key. And the public key is basically the dat link, so kind of like a URL. So you can use that in any anything that speaks with the the DAP protocol. And you can just sort of open that up and look at all the files inside of that. And then the the private key allows me to write files to that. And it’s used to sign any of the new changes. And so the private key allows Danielle to verify that the changes actually came for me and that somebody else wasn’t, wasn’t trying to fake my data, or somebody wasn’t trying to man in the middle of my, my data when I was transferring it to Danielle. So I added my spreadsheet to the data. And then the date, what that does is break that file into little chunks. It hashes all those trunks and creates a Merkel tree with that. And that Merkel tree, basically has lots of cool properties is one of the key key sort of features of data. So the Merkel tree allows us to sparsely replicated data. So if we had a really big data set, and you only want one file, we can sort of use the Merkel tree to download one file and then still verify the integrity of that content with that incomplete data set. And the other part that allows us to do that is the register. So all the files are stored in one register, and all the metadata is stored in another register. And these registers are basically append only Ledger’s. They’re also sort of known as secure registers. Google has a project called certificate transparency, that has similar ideas. And these registers, basically, you pen, whenever new file changes, you might append that to the metadata register, and that register stories based permission about the structure of the file system, what version it is, and then any other metadata, like the creation time for the change time of that file. And so right now, you know, as you said, Tobias, we we sort of are very flexible on sort of how things are implemented. But right now we basically store the files as files. So that’s sort of allows for people to see the files normally and interact with them normally. But the cool part about that is that the the on disk file storage can be really flexible. So as long as the implementation has random access, basically, then they can store it in any different way. So we have, for example, a server edge store storage model built for the server that stores all of the files as a single file. So that sort of allows you to have less file descriptors open and sort of shut, gets the the file I O all constrained to one file. So once my file gets added, I can share my link privately with Danielle and I can send that over chat or something or just paste it somewhere. And then she can clone my dad on using our command line tool or the desktop tool or the beaker browser. And when she clones my dad, our computer is basically connect directly to each other. So we use a variety mechanisms to try and do that connection. That’s been one of the challenges that I can talk about later, sort of how to how to connect peer to peer and the challenges around that. But then once we do connect, will transfer the data either over TCP or UDP. So those are default network protocols that we use right now. But yeah, that can be as automated basically, on any other protocol. I think Mathias once said that, that if you could implement it over carrier pigeon, that would work fine, as long as you had a lot of pigeons. So we’re really open to sort of how how the data as far as the protocol, information gets transferred. And we’re working over a dat over HTTP implementation too. So this wouldn’t be peer to peer. But it would allow basically traditional server fallback if no peers or online or for services that don’t want to run a peer to peer for whatever reason, once Danielle clones my, she can open it just like a normal file and plug it into a bar or Python or whatever. And use her equation to measure my caffeine level. And then let’s say I drink another cup of coffee and update my spreadsheet, the changes will basically automatically be synced to her, as long as she’s still connected to me. And it will it will be synced throughout the network to anybody else that’s connected to me. So the meditate, meditate or register stores that updated file information. And then the content register stores just the change file blocks. So Danielle only have to sync the death of that content change rather than the whole dataset again. So this is really useful for the big data sets, you know, I think the whole thing. And yeah, we’ve had to design basically each of these pieces to be as modular as possible both within our JavaScript demo the implementation, but also in the protocol in general. So right now, developers can swap other network protocols data storage. So for example, if you want to use that in the browser, you can use web RTC for the network and discovery and then use index DB for data storage. So index DB has random access. So you can just plug that in, directly into that. And we have some modules for those. And that should be working. We did have a web RTC implementation we were supporting for a while, but we found it a bit inconsistent for our use cases, which is, you know, more around like large file sharing. But it’s still might be okay for for chat and other more text based things. So, yeah, all of our implementations in Node right now.
\nI think that was that was both for, for usability and developer friendliness, and also just being able to work in the browser and across platforms. So we can distribute a binary now of that pretty easily. And you can run that in the browser or build dad tools on electron. So it sort of allows a wide range of, of developer tools built on top of that. But we have a few community members now working on different implementations and rust and see I think are the two, the two that are going right now. And so as far as the the protocol version in, that was actually one of the big conversations we were having in the last working group meeting. And that’s to be decided, basically, but through through the stages we’ve gone through, we’ve broken it quite a few times. And now we’re finally in a place where we we want to make sure not to break it moving forward. So there’s sort of space in the protocol for information like version history, or version of the protocol. So we’ll probably use that to signal the version and just figure out how, how the tools that are implementing it can fall back to the latest version. So before, before all the sort of file based stuff that went through a different a few different stages, it started really as a more like version, decentralized database. And then as as Max and Mathias and Krista sort of moved to the scientific use cases where they sort of removed more and more of the database architecture as it as it moved on and matured. So we basically, that transition was really driven by like user feedback and watching her researchers work. And we realized that so much of research data is still kept in files and basically moved manually between machines. So even if we were going to build like a special database, a lot of researchers still won’t be able to use that, because that sort of requires more more infrastructure than there they have time to support. So we really just kept working to build a general purpose solution that allows other people to build tools to solve those, those more specific problems. And the last point is that right now, all that transfer is basically one way so only one person can update the source. This is really useful for a lot of our research escape research cases where they’re getting data from lab equipment, where there’s like a specific source, and you just want to disseminate that information to various computers. But it really doesn’t work for collaboration. So that’s sort of the next thing that we’re working on. But we really want to make sure to solve, solve this sort of one way problem before we move to the harder problem of collaborative data sets. And this last major iteration is sort of the hardest. And that’s what we’re working in right now. But it’s sort of allows multiple users to write to the same that. And with that, we sort of get into problems like conflict resolution and, and duplicate updates and other other sort of harder distributed computing problems.
\nAnd that partially answers one of the next questions I had, which was to ask about conflict resolution. But if there’s only one source that’s allowed to update the information, then that solves a lot of the problems that might arise by sinking all these data sets between multiple machines, because there aren’t going to be multiple parties changing the data concurrently. So you don’t have to worry about how to handle those use cases. And another question that I had from what you were talking about is the cryptography aspect of that sounds as though when you initialize the data, it just automatically generates the pressure private key. And so that private key is chronically linked with that particular data set. But is there any way to use for instance, Coinbase or jpg, to sign the source that in addition to the generated key to establish your identity for some for when you’re trying to share that information publicly? And not necessarily via some channel that already has established trust?
\nYeah, I mean, you can sort of so once, I mean, you could, like do that within the that. We don’t really have any mechanism for doing that on top of that. So it’s, you know, we’re sort of going to throw that into user land right now. But, yeah, I mean, that’s a good good question. And we’ve we’ve had some people, I think, experimenting with different identity systems and and how to solve that problem. And I think we’re pretty excited about the, the new wire app, because that’s open source, and it uses end to end encryption and has some identity system and are sort of trying to see if we can sort of build that on top of wire. So that’s, that’s one of the things that we’re sort of experimenting with.
\nAnd one of the primary use cases that is mentioned in the documentation, and the website for that is being able to host and distribute open data sets with a focus being on researchers and academic use cases. So I’m wondering if you can talk some more about how that helps with that particular effort and what improvements it offers over some of the existing solutions that researchers were using prior
\nthere are solutions for both hosting and distributing data. And in terms of hosting and distribution. There’s a lot of great work, focused on data publication and making sure that data associated with publications is available online and thinking about the noto and Dryad or data verse. There are also other data hosting platforms such as see can or data dot world. And we really love the work these people do and we’ve collaborated with some of them are were involved in like, the organization of friendly org people life for the open source Alliance for open scholarship has some people from Dryad who are involved in it. And so it’s nice to work with them. And we’d love to work with them to use that to upload and distribute data. But right now, if researchers need to feed if researchers need to share files between many machines and keep them updated, and version, so for example, if there’s a large live updating data set, there really aren’t great solutions to address data version and sharing. So in terms of sharing, transferring lots of researchers still manually copy files between machines and servers, or use tools like our sink or FTP, which is how I handled it during my PhD. Other software such as Globus or even Dropbox box can require more IT infrastructure than small research group may have researchers like you know, they are all operating on limited grant funding. And they also depend on the it structure of their institution to get them access to certain things. So a researcher like me might spend all day collecting a terabyte of data on a microscope and then wait for hours or wait overnight to move it to another location. And the ideal situation from a data management perspective is that those raw data are automatically archived to the web server and sent to the researchers computer for processing. So you have an archived copy of the raw data that came off of the equipment. And in the process, files also need to be archived. So you need archives of the imaging files, in this case at each step in processing. And then when a publication is ready, the data processing pipeline, in order for it to be fully reproducible, you’ll need the code and you’ll need the data at different stages. And even without access to to compete, the computer, the cluster where the analysis was done, a person should be able to repeat that. And I say ideally, because this isn’t really how it’s happening. Now.
\narchiving data, a different steps can be the some of the things that stop that from happening, or just cost of storage, and the availability of storage and researcher habits. So I definitely, you know, know some researchers who kept data on hard drives and Tupperware to protect them in case the sprinklers ever went off, which isn’t really like a long term solution, true facts. So that can make on can automate these archiving steps at different checkpoints and make the backups easier for researchers. As a former researcher, I’m interested in anything that makes better data management automatic for researchers. And so we’re also interested in version computer environments to help labs avoid the drawer full of jobs tribes problem, which is sadly, a quote from a senior scientist who was describing a bunch of data collected by her lab that she can no longer access, she has the drawer, she has the jazz drives, she can’t get in them, that data is essentially lost. And so researchers are really motivated to make sure when things are archived, they’re archived in a forum where they can actually be accessed. But I think, because researchers are so busy, it’s really hard to know like, when that is, so I think because we’re so focused on essentially like filling in the gaps between the services that researchers use, and it worked well for them and automating things, I think that that’s in a really good position to solve some of these problems. And if you have, you know, some of the researchers that we’re working with now, I’m thinking of one person who has a large data set and bioinformatics pipeline, and he’s at a UC lab, and he wants to get all the information to his closet right here in Washington State. And it’s taken months, and he has not been able to do it or he can get he can’t, he just can’t move that data across institutional lines. So and that’s a much longer conversation as to like why exactly that isn’t working. But we’re working with him to try to just make him make it possible for him to move the data and create a version iteration or a version emulation of his compute environment so that his collaborator can just do what he was doing and not need to spend four months worrying about dependencies and stuff. So yeah, hopefully, that’s the question.
\nAnd one of the other difficult aspects of building a peer to peer protocol is the fact that in order for there to be sufficient value in the protocol itself is there needs to be a network behind it of people to be able to share that information with and share the bandwidth requirements for being able to distribute that in front. So I’m wondering how you have approached the effort of building up that network, and how much progress you feel you have made in that effort?
\nYeah, I’m not sure we really view that as as that traditional peer to peer protocol, I’m using that model sort of relying on on network effects to scale. So you know, as Danielle said, we’re just trying to get data from A to B. And so our critical mass is basically to users on a given data set. So obviously, we want to first build something that offers better tools for those to users over traditional cloud or client server model. So if I’m transferring files to another researcher using Dropbox, you know, we have to transfer files via a third party and a third computer before it can get to the other computer. So rather than going direct between two computers, we have to go through a detour. And this has implications for speed, but also security bandwidth usage, and even something like energy usage. So by cutting off at their computer, we feel like we’re we’re already about adding value to the network, we’re sort of hoping that when when researchers that are doing this HDB transfer, they they can sort of see the value of going directly. And and using something that is version and can like be life synced over existing tools, like our st corrected E or, or the commercial services that might store data in the cloud. And you know, we really don’t have anything against the centralized services, we sort of recognize that they’re very useful sometimes. But they, they also aren’t the answer to everything. And so depending on the use case, decentralized system might make more sense than a centralized one. And so we sort of want to offer developer and users that option to make that choice, which we don’t really have right now. But in order to do that, we really have to start with peer to peer tools first. And then once we have that decentralized network, we can basically limit the network to one server peer in many clients, and then all of a sudden, it’s centralized. So we sort of understand that, that it’s easy to go from the centralized, decentralized, but it’s harder to go the other way around, we sort of have to start with a peer to peer network in order to solve all these different problems. And the other thing is that we sort of know, file systems are not going away. We know that that web browsers will continue to support static files. And we also know that people will basically want to move these things between computers, back them up, archive them, share them two different computers. So we sort of know files are going to be transferred a lot in the future. And that’s something we we can, we can depend on. And they probably even want to do this in a secure way sometimes, and maybe in an offline environment or a local network. And so we’re basically trying to build from that those basic principles, using sort of peer to peer transfer is the sort of bedrock of all that. And that’s sort of how we got to where we are now with the peer to peer network. But we’re not really worried that that we need a certain number of or critical mass of users to add value, because we just sort of feel like by building the right tools, with these principles, we can, we can start adding value, whether it’s a decentralized network or a centralized network.
\nAnd one of the other use cases that’s been built on top of that is being able to build websites and applications that can be viewed by a web browsers and distributed peer to peer in that manner. So I’m wondering how much uptake you’ve seen and usage for that particular application of the protocol? And how much development effort is being focused on that particular use case?
\nYeah, so you know, if I open my bigger browser right now, which is the main the main web implementation we have that Paul frizzy and Tara Bansal are working on, you know, if I open my my bigger browser, I think I usually have 50, to 100, or sometimes 200, peers that I connected right away. So that’s through some of the social network copies, like, wrote on their freighter, and then just some, like personal sites. And you know, we’ve sort of been working with the beaker browser folk probably for two years now. Sort of CO developing the protocol and, and seeing what they need support for in beaker. But you know, it sort of comes back comes back to that basic Brynn pull that we can recognize that a lot of websites are static files. And if we can just sort of support static files in the best way possible, then you can browse a lot of websites. And that even gives you the benefit of things that are more interactive, we know that they have to be developed. So they work offline, too. So both Cortana and Twitter can work offline. And then once you get back online, you can just sync the data sort of seamlessly. So that’s sort of the most exciting part about those.
\nYou mean, fritter not.
\nfreighter is the Twitter clone that Tara Bansal and Paul made beakers, a lot of fun. And if you’ve never played around with it, I would encourage you to download it. I think it’s just speaker browser calm. And I’m not a developer by trade. But I have seriously enjoyed playing around on beaker. And I think the some of the more frivolous things like printer that have come out of it are a lot of fun, and really speak to the potential of peer to peer networks in today’s era as people are becoming increasingly frustrated with the centralized platforms.
\nAnd the fact that the content that’s being distributed via that using the browser is primarily static in nature, I’m wondering how that affects the sort of architectural patterns that people are used to using with the common three tier architecture. And what are you’ve already mentioned, a couple of social network applications that have been built on top of it, but I’m wondering if there any others that are built on top of and delivered via that, that you’re aware of the you could talk about that speak to some of the ways that people are taking advantage of that in more of the consumer space?
\nYeah, I mean, I think, you know, one of the big shifts that have made this easier is having databases in the browser, so things like index DB or other local storage databases, and then be able to sync those two other computers. So as long as you sort of know that, I’m writing to my database, and that, you know, if I’m writing my, I think people are trying to build games off this. So you know, you could build a chess game where I write to my local database, and then you have some logic for determining if a move is valid or not, and then sinking that to your competitor, you know, it sort of provides, it’s a more constrained environment. But I think that also gives you a benefit of, of sort of being able to constrain your development and, and not requiring these external services or external database calls or whatever. I know that I’ve tried a few times to sort of develop projects are just like fun little things. And it is a challenge, it’s a challenge, because you sort of have to think differently, how those things work, and you can’t rely necessarily on on external services, you know, whether that’s something as simple as like, loading fonts from external service, or CSS styles or whatever, external JavaScript, you sort of want that all to be packaged within one, one day, if you want to ensure it’s all going to work. So it’s def has, you know, you think of a little differently even on those those simple things. But yeah, it does constrain the sort of bigger applications. And, you know, I think the other area that that we could see development is more in electron applications. So maybe not in beaker, but electron, using that sort of framework as as a platform for other types of applications that might need those more sort of flexible models. So science fair, which is one of our hosted projects, is a really good example of how, how to use that in a way to distribute data, but still sort of have a full application. So basically, you can distribute all the data for the application over that and keep it updated through the live sinking. And users can basically download the the PDFs that they need to read, or the journals or the figures they want to read. And just download whatever they want sort of allowing developers to have that flexible model where you can distribute things peer to peer and have both the live sinking, but also just downloading whatever data that users need, and just providing that framework for, for that data management.
\nAnd one of the other challenges that’s posed, particularly for this public distribution, use case is that content discovery, because the By default, the URLs that are generated, are private, and ungraspable, because they’re essentially just hashes of the content. So I’m wondering if there are any particular mechanisms that you either have built or planned or started discussing for being able to facilitate content discovery of the information that’s being distributed by these different networks?
\nYeah, this is definitely an open question. I sort of fall back on my comment answer, which is depends on the the tool that we’re using and the different communities and there’s going to be different approaches, some might be more decentralized, and some might be centralized. So, for example, with data set discovery, you know, there’s a lot of good centralized services for data set publishing, as Daniel mentioned, like pseudo or data verse. So these are places that already have discovery engines, I guess we’ll say, and they published data sets. So you know, you could sort of similarly publish that URL along with those those data sets so that people could sort of have an alternative way to download those data sets. So that’s, that’s sort of one way that we’ve been thinking about discovery is sort of leveraging these existing solutions that are doing a really good job in their domain, and trying to work with them to start using that for their their data management. Another sort of hacky solution, I guess I’ll say is using existing domains and DNS. So basically, you can publish a regular HTTP site on your URL, and give it a specific well known file, and that points to your debt address. And then the baker browser can find that file and tell you that a peer to peer version of that site is available. So we’re basically leveraging the existing DNS infrastructure to start to discover content just with existing URLs. And I think a lot of the discovery will be more community based. So in, for example, fritter in rotund people are starting to build crawlers or search bots, to discover users or search and so basically, just sort of looking at where there is need, and identifying, you know, different types of crawlers to build and, and how to connect those communities in different ways. So we’re really excited to see what what ideas pop in that in that area. And they’ll probably come in in a decentralized way, we hope.
\nAnd for somebody who wants to start using that what is involved in creating and or consuming the content that’s available on the network, or if there any particular resources that are available to get somebody up to speed and understand how it works and some of the different uses that they could put it to?
\nSure, I can take that. And Joe just chime in. If you think of anything else, we built a tutorial for our work with the labs and for Ma’s fest this year that’s at try dash calm. And this tutorial takes you through how to work with the command line tool and some basics about beaker. And please tell us if you find a bug, there may be bugs morning. But it was working pretty well when I use the last and it’s in the browser. And you can either share data with yourself it spins up a little virtual machine. So you can share data with yourself or you can do it with a friend and share data with your friend. So beakers also super easy for a user who wants to get started, you can visit pages of her dad just like you would a normal web page. For example, you can go to this website, and we’ll give Tobias the link to that. And just change the end PTP to dat. And so it looks like dat colon slash slash j handout space. And beaker also has this fun thing that lets you create a new site with a single click. And you can also fork sites and edit them and make your own copies of things, which is fun if you’re like learning about how to build several websites. So you can go to bigger browser calm and learn about that. And I think we’ve already talked about return and fritter. And we’ll add links into people who want to learn more about that. And then for data focused users, you can use that for sharing or transferring files, either with the desktop application or the command line interface. And so if you’re interested, we encourage you to play around the community is really friendly and helpful to new people. Joe and I are always on the IRC channel or on Twitter. So if you have questions, feel free to ask and we love talking to new people, because that’s how all the exciting stuff happens in this community. So
\nand what have been some of the most challenging aspects of building the project in the community and promoting the use cases and capabilities of the project,
\nI can speak a little bit to promoting it in the academic research. So in academic research, probably similar to many of the industries where your listeners work, software decisions are not always made for entirely rational reasons. There’s tension between what your boss wants what the IT department has approved, that means institutional data security needs, and then the perceived time cost of developing a new workflow and getting used to a new protocol. So we try to work directly with researchers to make sure the things we build are easy and secure. But it is a lot of promotion and outreach to get their scientists to try a new workflow. They’re really busy. And the incentives are all you know, get more grants, do more projects, publish more papers. And so even if something will eventually make your life easier, it’s hard to sink in time up front. One thing I noticed, and this is probably common to all industries is that people will I’ll be talking to someone and they’ll say, Oh, you know, archiving the data from my research group is not a problem for me. And then they’ll proceed to describe a super problematic data management workflow. And it’s not a problem for them anymore, because they’re used to it. So it doesn’t hurt day to day. But you know, doing things like waiting until the point of publication, then try to go back and archive all the raw data, maybe someone was collected by a postdoc who’s now gone, other was collected by a summer student who used a non standard naming scheme for all the files, you know, there’s just a million ways that that stuff can go wrong. So for now, we’re focusing on developing real world use cases, and participating in you know, community education around data management. And we want to build stuff that’s meaningful for researchers and others who work with data. And we think that by working with people and doing the nonprofit thing, grants is going to be the way to get us there. God want to talk a little bit about building.
\nYeah, sure. So you know, in terms of building it, I mean, I haven’t done too much work on the core protocol. So I can’t say much around the difficult design decisions there. I’m the main developer on the command line tool. And the most of the challenging decisions, they’re all are about sort of user interfaces, not necessarily technical problems. And so as Danielle said, it’s sort of as much about people as it is around software and and those decisions. But I think, you know, one of the, the most challenging thing that we’ve run into a lot is, is basically network issues. So in the peer to peer network, you know, you have to figure out how to connect to peers directly in a network, they might not be supposed to do that. So I think a lot of that is from BitTorrent sort of making different institutions restrict peer to peer networking in different ways. And, and so we’re sort of having to fight that battle against these existing restrictions and trying to find out how these networks are restrictive, and how we can continue to have success in connecting peers directly rather than through through a third party server. And it’s funny because, or maybe not funny, but some of the strictest network, we found are actually in academic institutions. And so, you know, some, for example, one of the UC campuses, I think, we found out that computers can never connect directly to each other computers on that same network. So if we wanted to transfer data between two computers sitting right next to each other, we basically have to go through an external cloud server just to get it to the computer sitting right next to each other, or, you know, you suddenly like a hard drive, or a thumb drive or whatever. But you know, that sort of thing. All these different sort of network configurations, I think, is one of the hardest parts, both in terms of implementation. But also in terms of testing, since we can’t, we can’t like readily get into these UC campuses or sort of see what the, what the network setup is. So we’re sort of trying to create more tools around network scene and both testing networks in the wild, but also just sort of using virtual networks to test different different types of network setups and sort of leverage that those two things combined to try and get around around all these network connection issues. So yeah, I think, you know, I would love to ask Mathias to this question around the design decisions in terms of the core protocol. But, but I can’t really say much about that, unfortunately.
\nAnd are there any particularly interesting or inspiring uses of that, that you’re aware of that you’d like to share?
\nSure, I can share a couple of things that we were involved in. During last in January 2016, we were involved in the data rescue and libraries plus network community. And that was the movement to archive government funded research at trusted public institutions like libraries and archives. And as a part of that, we got to work with some of the really awesome people at California Digital Library, California Digital Library is really cool, because it is digital library with a mandate to preserve and archive and steward the data that’s produced in the UC system. And it supports the entire UC system. And the people are great. And so we worked with them to make the the first ever backup of data.gov in January of 2016. And I think my colleague had 40 terabytes of metadata sitting in his living room for a while as we were working up to the transfer. And so that was a really cool project. And it has produced a useful thing. And it’s sort of, you know, we got to work with some of the data.gov people to make that happen. And they, you know, they were like how, really, it has never been backed up, that it was a good time to do it. But believe it or not, it’s actually pretty hard to find funding for that work. And we have more work we’d like to do in that space. archiving copies of federally funded research at trusted institutions is a really critical step towards ensuring the long term preservation of the research that gets done in this country. So hopefully, 2018 will see those projects funded or new collaborations in that space. Also, it’s a fantastic community, because it’s a lot of really interesting librarians and archivists who have great perspective on long term data preservation, and I love working with them. So hopefully, we can do something else there. Then the other thing that I’m really excited about is the working on the data in the lab project working on the debt container. issue. And I don’t mind over a little over time. So I don’t know how much I shouldn’t go into this. But we’ve learned a lot about really interesting research. And so we’re working to develop a container based simulation of a Research Computing cluster, that can run on any machine or in the cloud. And then by creating a container that will include the complete software environment of the cluster, researchers across the UC system can quickly get analysis pipelines that they’re working on us usable in other locations. And this Believe it or not, is it is it big problem, I was sort of surprised when one researcher told me she had been working for four months to get a pipeline running at UC Merced said that had been developed at UCLA. And that’s like, you could drive back and forth between her said, and UCLA a bunch of times in four months. But it’s this little stuff that really slows research down. And so I’m really excited about the potential there. And we wrote, we’ve written a couple blog posts on that. So I can add the links to those blog posts and in the follow up.
\nAnd I’d say the most novel use that I’m sort of excited about is called hyper vision. And it’s basically video streaming and built on that Mathias booth, one of the lead developers on that is prototyping sort of something similar with the Danish public TV. And they basically want to live stream their, their channels over the peer to peer network. So I’m excited about that, because I’d really love to get more public television and Public Radio distributing content, peer to peer, so we can sort of reduce their their infrastructure costs and hopefully, allow for for more of that great content to come out.
\nAre there any other topics that we didn’t discuss yet? What do you think we should talk about before we close out the show?
\nUm, I think I’m feeling pretty good. What about you, Joe?
\nYeah, I think that’s it for me. Okay.
\nSo for anybody who wants to keep up to date with the work you’re doing or get in touch, we’ll have you each add your preferred contact, excuse me, your preferred contact information to the show notes. And as a final question, to give the listeners something else to think about, from your perspective, what is the biggest gap in the tooling or technology that’s available for data management today?
\nI’d say transferring files, which feels really funny to say that, but to me, it’s still a problem that’s not really well solved. Just how do you get files from A to B in a consistent and easy to use manner, especially want a solution that doesn’t really require a command line, and is still secure, and hopefully doesn’t go through a third party service. Because hopefully, that means it works offline. So a lot of what I saw in the sort of developing world is the need for data management that works offline. And I think that’s, that’s one of the biggest gaps that we don’t really address yet. So there are a lot of great data data management tools out there. But I think they sort of aimed more at data scientists or software focused users that might use manage databases or something like a dupe. But there’s really a ton of users out there that don’t really have tools. Indeed, and most of the world is still offline or with inconsistent internet and putting everything through the servers on the cloud isn’t really feasible. But the alternatives now require sort of careful data management and manual data management if you don’t want to lose all your data. So we really hope to find a good balance between those those two needs in those two use cases. Yeah.
\nPlus one with Joe said, transferring files, it does feel funny to say that, but it is still a problem in a lot of industries, and especially where I come from in research science. And from my perspective, I guess the other issue is that, you know, the people problems are always as hard or harder than the technical problems. So if people don’t think that it’s important to share data or archive data, in an accessible and usable form, we could have the world’s best easy to use tool, and it wouldn’t impact the landscape or the accessibility of data. And similarly, if people are sharing data that’s not usable, because it’s missing experimental context, or it’s in a proprietary format, or because it’s shared under a restrictive license, it’s also not going to impact the landscape, or be useful to the scientific community or the public. So working to change, we want to build great tools. But I also want to work to change the incentive structure and research to ensure that good data management practices are rewarded. And so that data is shared in a usable form. That’s really key. And I’ll add a link in the show notes to the fair data principles, which means data should be fundable, testable, interoperable, and reusable, something that your listeners might want to check out if they’re not familiar with it. It’s a framework developed in academia. But I’m not sure actually how much impacts had outside of that sphere. But it would be interesting to talk to your listeners a little bit about that. And yeah, I’ll put my contact info in the show notes. And I’d love to connect with anyone and or answer any further questions about that, and what we’re going to try to do with coatings for science and society over the next year. So thanks a lot, Tobias, for inviting us.
\nYeah, absolutely. Thank you both for taking the time out of your days to join me and talk about the work you’re doing. It’s definitely a very interesting project with a lot of useful potential. And so I’m excited to see where you go from now into the future. So thank you both for your time and I hope you enjoy the rest of your evening.
\nThank you. Thank you.
\nTranscribed by https://otter.ai?utm_source=rss&utm_medium=rss
\n \nThe majority of the conversation around machine learning and big data pertains to well-structured and cleaned data sets. Unfortunately, that is just a small percentage of the information that is available, so the rest of the sources of knowledge in a company are housed in so-called “Dark Data” sets. In this episode Alex Ratner explains how the work that he and his fellow researchers are doing on Snorkel can be used to extract value by leveraging labeling functions written by domain experts to generate training sets for machine learning models. He also explains how this approach can be used to democratize machine learning by making it feasible for organizations with smaller data sets than those required by most tooling.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Snorkel: Extracting Value From Dark Data With Python (Interview)","date_published":"2018-01-21T23:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/338aa66c-3ab5-43fe-8d58-cbbcc5b37a89.mp3","mime_type":"audio/mpeg","size_in_bytes":23528277,"duration_in_seconds":2232}]},{"id":"podlove-2018-01-15t01:07:25+00:00-64fbecf7a8498a6","title":"CRDTs and Distributed Consensus with Christopher Meiklejohn - Episode 14","url":"https://www.dataengineeringpodcast.com/crdts-with-christopher-meiklejohn-episode-14","content_text":"Summary\n\nAs we scale our systems to handle larger volumes of data, geographically distributed users, and varied data sources the requirement to distribute the computational resources for managing that information becomes more pronounced. In order to ensure that all of the distributed nodes in our systems agree with each other we need to build mechanisms to properly handle replication of data and conflict resolution. In this episode Christopher Meiklejohn discusses the research he is doing with Conflict-Free Replicated Data Types (CRDTs) and how they fit in with existing methods for sharing and sharding data. He also shares resources for systems that leverage CRDTs, how you can incorporate them into your systems, and when they might not be the right solution. It is a fascinating and informative treatment of a topic that is becoming increasingly relevant in a data driven world.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nYour host is Tobias Macey and today I’m interviewing Christopher Meiklejohn about establishing consensus in distributed systems\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nYou have dealt with CRDTs with your work in industry, as well as in your research. Can you start by explaining what a CRDT is, how you first began working with them, and some of their current manifestations?\nOther than CRDTs, what are some of the methods for establishing consensus across nodes in a system and how does increased scale affect their relative effectiveness?\nOne of the projects that you have been involved in which relies on CRDTs is LASP. Can you describe what LASP is and what your role in the project has been?\nCan you provide examples of some production systems or available tools that are leveraging CRDTs?\nIf someone wants to take advantage of CRDTs in their applications or data processing, what are the available off-the-shelf options, and what would be involved in implementing custom data types?\nWhat areas of research are you most excited about right now?\nGiven that you are currently working on your PhD, do you have any thoughts on the projects or industries that you would like to be involved in once your degree is completed?\n\n\nContact Info\n\n\nWebsite\ncmeiklejohn on GitHub\nGoogle Scholar Citations\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nBasho\nRiak\nSyncfree\nLASP\nCRDT\nMesosphere\nCAP Theorem\nCassandra\nDynamoDB\nBayou System (Xerox PARC)\nMultivalue Register\nPaxos\nRAFT\nByzantine Fault Tolerance\nTwo Phase Commit\nSpanner\nReactiveX\nTensorflow\nErlang\nDocker\nKubernetes\nErleans\nOrleans\nAtom Editor\nAutomerge\nMartin Klepman\nAkka\nDelta CRDTs\nAntidote DB\nKops\nEventual Consistency\nCausal Consistency\nACID Transactions\nJoe Hellerstein\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"As we scale our systems to handle larger volumes of data, geographically distributed users, and varied data sources the requirement to distribute the computational resources for managing that information becomes more pronounced. In order to ensure that all of the distributed nodes in our systems agree with each other we need to build mechanisms to properly handle replication of data and conflict resolution. In this episode Christopher Meiklejohn discusses the research he is doing with Conflict-Free Replicated Data Types (CRDTs) and how they fit in with existing methods for sharing and sharding data. He also shares resources for systems that leverage CRDTs, how you can incorporate them into your systems, and when they might not be the right solution. It is a fascinating and informative treatment of a topic that is becoming increasingly relevant in a data driven world.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"CRDTs, Conflict Resolution, and Distributed Consensus in Real World Systems (Interview)","date_published":"2018-01-14T20:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/61c189d7-9415-4806-91a1-9e3e2593f553.mp3","mime_type":"audio/mpeg","size_in_bytes":29084349,"duration_in_seconds":2743}]},{"id":"podlove-2018-01-08t01:01:23+00:00-487089bc2c64a77","title":"Citus Data: Distributed PostGreSQL for Big Data with Ozgun Erdogan and Craig Kerstiens - Episode 13","url":"https://www.dataengineeringpodcast.com/citus-data-with-ozgun-erdogan-and-craig-kerstiens-episode-13","content_text":"Summary\n\nPostGreSQL has become one of the most popular and widely used databases, and for good reason. The level of extensibility that it supports has allowed it to be used in virtually every environment. At Citus Data they have built an extension to support running it in a distributed fashion across large volumes of data with parallelized queries for improved performance. In this episode Ozgun Erdogan, the CTO of Citus, and Craig Kerstiens, Citus Product Manager, discuss how the company got started, the work that they are doing to scale out PostGreSQL, and how you can start using it in your environment.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nContinuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nYour host is Tobias Macey and today I’m interviewing Ozgun Erdogan and Craig Kerstiens about Citus, worry free PostGreSQL\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nCan you describe what Citus is and how the project got started?\nWhy did you start with Postgres vs. building something from the ground up?\nWhat was the reasoning behind converting Citus from a fork of PostGres to being an extension and releasing an open source version?\nHow well does Citus work with other Postgres extensions, such as PostGIS, PipelineDB, or Timescale?\nHow does Citus compare to options such as PostGres-XL or the Postgres compatible Aurora service from Amazon?\nHow does Citus operate under the covers to enable clustering and replication across multiple hosts?\nWhat are the failure modes of Citus and how does it handle loss of nodes in the cluster?\nFor someone who is interested in migrating to Citus, what is involved in getting it deployed and moving the data out of an existing system?\nHow do the different options for leveraging Citus compare to each other and how do you determine which features to release or withhold in the open source version?\nAre there any use cases that Citus enables which would be impractical to attempt in native Postgres?\nWhat have been some of the most challenging aspects of building the Citus extension?\nWhat are the situations where you would advise against using Citus?\nWhat are some of the most interesting or impressive uses of Citus that you have seen?\nWhat are some of the features that you have planned for future releases of Citus?\n\n\nContact Info\n\n\nCitus Data\n\ncitusdata.com\n@citusdata on Twitter\ncitusdata on GitHub\n\n\n\nCraig\n\n\nEmail\nWebsite\n@craigkerstiens on Twitter\n\n\n\nOzgun\n\n\nEmail\nozgune on GitHub\n\n\n\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nCitus Data\nPostGreSQL\nNoSQL\nTimescale SQL blog post\nPostGIS\nPostGreSQL Graph Database\nJSONB Data Type\nPipelineDB\nTimescale\nPostGres-XL\nAurora PostGres\nAmazon RDS\nStreaming Replication\nCitusMX\nCTE (Common Table Expression)\nHipMunk Citus Sharding Blog Post\nWal-e\nWal-g\nHeap Analytics\nHyperLogLog\nC-Store\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"PostGreSQL has become one of the most popular and widely used databases, and for good reason. The level of extensibility that it supports has allowed it to be used in virtually every environment. At Citus Data they have built an extension to support running it in a distributed fashion across large volumes of data with parallelized queries for improved performance. In this episode Ozgun Erdogan, the CTO of Citus, and Craig Kerstiens, Citus Product Manager, discuss how the company got started, the work that they are doing to scale out PostGreSQL, and how you can start using it in your environment.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Scaling PostGreSQL for Big Data and Parallel Execution with Citus Data (Interview)","date_published":"2018-01-07T20:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/740992b7-24f5-4101-af9d-95ef879a51cd.mp3","mime_type":"audio/mpeg","size_in_bytes":35123320,"duration_in_seconds":2804}]},{"id":"podlove-2017-12-25t03:47:39+00:00-b66e6101d7f468c","title":"Wallaroo with Sean T. Allen - Episode 12","url":"https://www.dataengineeringpodcast.com/wallaroo-with-sean-t-allen-episode-12","content_text":"Summary\n\nData oriented applications that need to operate on large, fast-moving sterams of information can be difficult to build and scale due to the need to manage their state. In this episode Sean T. Allen, VP of engineering for Wallaroo Labs, explains how Wallaroo was designed and built to reduce the cognitive overhead of building this style of project. He explains the motivation for building Wallaroo, how it is implemented, and how you can start using it today.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nContinuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nYour host is Tobias Macey and today I’m interviewing Sean T. Allen about Wallaroo, a framework for building and operating stateful data applications at scale\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data engineering?\nWhat is Wallaroo and how did the project get started?\nWhat is the Pony language, and what features does it have that make it well suited for the problem area that you are focusing on?\nWhy did you choose to focus first on Python as the language for interacting with Wallaroo and how is that integration implemented?\nHow is Wallaroo architected internally to allow for distributed state management?\n\nIs the state persistent, or is it only maintained long enough to complete the desired computation?\nIf so, what format do you use for long term storage of the data?\n\n\n\nWhat have been the most challenging aspects of building the Wallaroo platform?\nWhich axes of the CAP theorem have you optimized for?\nFor someone who wants to build an application on top of Wallaroo, what is involved in getting started?\nOnce you have a working application, what resources are necessary for deploying to production and what are the scaling factors?\n\n\nWhat are the failure modes that users of Wallaroo need to account for in their application or infrastructure?\n\n\n\nWhat are some situations or problem types for which Wallaroo would be the wrong choice?\nWhat are some of the most interesting or unexpected uses of Wallaroo that you have seen?\nWhat do you have planned for the future of Wallaroo?\n\n\nContact Info\n\n\nIRC\nMailing List\nWallaroo Labs Twitter\nEmail\nPersonal Twitter\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nWallaroo Labs\nStorm Applied\nApache Storm\nRisk Analysis\nPony Language\nErlang\nAkka\nTail Latency\nHigh Performance Computing\nPython\nApache Software Foundation\nBeyond Distributed Transactions: An Apostate’s View\nConsistent Hashing\nJepsen\nLineage Driven Fault Injection\nChaos Engineering\nQCon 2016 Talk\nCodemesh in London: How did I get here?\nCAP Theorem\nCRDT\nSync Free Project\nBasho\nWallaroo on GitHub\nDocker\nPuppet\nChef\nAnsible\nSaltStack\nKafka\nTCP\nDask\nData Engineering Episode About Dask\nBeowulf Cluster\nRedis\nFlink\nHaskell\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"Data oriented applications that need to operate on large, fast-moving sterams of information can be difficult to build and scale due to the need to manage their state. In this episode Sean T. Allen, VP of engineering for Wallaroo Labs, explains how Wallaroo was designed and built to reduce the cognitive overhead of building this style of project. He explains the motivation for building Wallaroo, how it is implemented, and how you can start using it today.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Fast and Scalable Real-Time Stream Computation with Wallaroo (Interview)","date_published":"2017-12-24T23:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/93644a6e-a334-4986-9855-bd725443f137.mp3","mime_type":"audio/mpeg","size_in_bytes":38436390,"duration_in_seconds":3553}]},{"id":"podlove-2017-12-18t01:05:08+00:00-5cc4317fde052d2","title":"SiriDB: Scalable Open Source Timeseries Database with Jeroen van der Heijden - Episode 11","url":"https://www.dataengineeringpodcast.com/siridb-with-jeroen-van-der-heijden-episode-11","content_text":"Summary\n\nTime series databases have long been the cornerstone of a robust metrics system, but the existing options are often difficult to manage in production. In this episode Jeroen van der Heijden explains his motivation for writing a new database, SiriDB, the challenges that he faced in doing so, and how it works under the hood.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nContinuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nYour host is Tobias Macey and today I’m interviewing Jeroen van der Heijden about SiriDB, a next generation time series database \n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data engineering?\nWhat is SiriDB and how did the project get started?\n\nWhat was the inspiration for the name?\n\n\n\nWhat was the landscape of time series databases at the time that you first began work on Siri?\nHow does Siri compare to other time series databases such as InfluxDB, Timescale, KairosDB, etc.?\nWhat do you view as the competition for Siri?\nHow is the server architected and how has the design evolved over the time that you have been working on it?\nCan you describe how the clustering mechanism functions?\n\n\nIs it possible to create pools with more than two servers?\n\n\n\nWhat are the failure modes for SiriDB and where does it fall on the spectrum for the CAP theorem?\nIn the documentation it mentions needing to specify the retention period for the shards when creating a database. What is the reasoning for that and what happens to the individual metrics as they age beyond that time horizon?\nOne of the common difficulties when using a time series database in an operations context is the need for high cardinality of the metrics. How are metrics identified in Siri and is there any support for tagging?\nWhat have been the most challenging aspects of building Siri?\nIn what situations or environments would you advise against using Siri?\n\n\nContact Info\n\n\njoente on Github\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nSiriDB\nOversight\nInfluxDB\nLevelDB\nOpenTSDB\nTimescale DB\nKairosDB\nWrite Ahead Log\nGrafana\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"Time series databases have long been the cornerstone of a robust metrics system, but the existing options are often difficult to manage in production. In this episode Jeroen van der Heijden explains his motivation for writing a new database, SiriDB, the challenges that he faced in doing so, and how it works under the hood.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"SiriDB: Scalable Timeseries Database For Your System Metrics (Interview)","date_published":"2017-12-17T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/51ad4bd7-bcc1-4113-ab81-01066da679dc.mp3","mime_type":"audio/mpeg","size_in_bytes":21279629,"duration_in_seconds":2032}]},{"id":"podlove-2017-12-10t14:18:54+00:00-d3d8c2983023413","title":"Confluent Schema Registry with Ewen Cheslack-Postava - Episode 10","url":"https://www.dataengineeringpodcast.com/confluent-schema-registry-with-ewen-cheslack-postava-episode-10","content_text":"Summary\n\nTo process your data you need to know what shape it has, which is why schemas are important. When you are processing that data in multiple systems it can be difficult to ensure that they all have an accurate representation of that schema, which is why Confluent has built a schema registry that plugs into Kafka. In this episode Ewen Cheslack-Postava explains what the schema registry is, how it can be used, and how they built it. He also discusses how it can be extended for other deployment targets and use cases, and additional features that are planned for future releases.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nContinuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nYour host is Tobias Macey and today I’m interviewing Ewen Cheslack-Postava about the Confluent Schema Registry\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data engineering?\nWhat is the schema registry and what was the motivating factor for building it?\nIf you are using Avro, what benefits does the schema registry provide over and above the capabilities of Avro’s built in schemas?\nHow did you settle on Avro as the format to support and what would be involved in expanding that support to other serialization options?\nConversely, what would be involved in using a storage backend other than Kafka?\nWhat are some of the alternative technologies available for people who aren’t using Kafka in their infrastructure?\nWhat are some of the biggest challenges that you faced while designing and building the schema registry?\nWhat is the tipping point in terms of system scale or complexity when it makes sense to invest in a shared schema registry and what are the alternatives for smaller organizations?\nWhat are some of the features or enhancements that you have in mind for future work?\n\n\nContact Info\n\n\newencp on GitHub\nWebsite\n@ewencp on Twitter\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\nKafka\nConfluent\nSchema Registry\nSecond Life\nEve Online\nYes, Virginia, You Really Do Need a Schema Registry\nJSON-Schema\nParquet\nAvro\nThrift\nProtocol Buffers\nZookeeper\nKafka Connect\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"To process your data you need to know what shape it has, which is why schemas are important. When you are processing that data in multiple systems it can be difficult to ensure that they all have an accurate representation of that schema, which is why Confluent has built a schema registry that plugs into Kafka. In this episode Ewen Cheslack-Postava explains what the schema registry is, how it can be used, and how they built it. He also discusses how it can be extended for other deployment targets and use cases, and additional features that are planned for future releases.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"How Centralized Schemas Help Tame Distributed Streaming Analytics (Interview)","date_published":"2017-12-10T09:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/ea534908-dfac-4dff-a31e-306c1ad7ba87.mp3","mime_type":"audio/mpeg","size_in_bytes":31598353,"duration_in_seconds":2961}]},{"id":"podlove-2017-12-03t03:11:07+00:00-fc2c042995799e7","title":"data.world with Bryon Jacob - Episode 9","url":"https://www.dataengineeringpodcast.com/data-dot-world-with-bryon-jacob-episode-9","content_text":"Summary\n\nWe have tools and platforms for collaborating on software projects and linking them together, wouldn’t it be nice to have the same capabilities for data? The team at data.world are working on building a platform to host and share data sets for public and private use that can be linked together to build a semantic web of information. The CTO, Bryon Jacob, discusses how the company got started, their mission, and how they have built and evolved their technical infrastructure.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nContinuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nThis is your host Tobias Macey and today I’m interviewing Bryon Jacob about the technology and purpose that drive data.world\n\n\nInterview\n\n\nIntroduction\nHow did you first get involved in the area of data management?\nWhat is data.world and what is its mission and how does your status as a B Corporation tie into that?\nThe platform that you have built provides hosting for a large variety of data sizes and types. What does the technical infrastructure consist of and how has that architecture evolved from when you first launched?\nWhat are some of the scaling problems that you have had to deal with as the amount and variety of data that you host has increased?\nWhat are some of the technical challenges that you have been faced with that are unique to the task of hosting a heterogeneous assortment of data sets that intended for shared use?\nHow do you deal with issues of privacy or compliance associated with data sets that are submitted to the platform?\nWhat are some of the improvements or new capabilities that you are planning to implement as part of the data.world platform?\nWhat are the projects or companies that you consider to be your competitors?\nWhat are some of the most interesting or unexpected uses of the data.world platform that you are aware of?\n\n\nContact Information\n\n\n@bryonjacob on Twitter\nbryonjacob on GitHub\nLinkedIn\n\n\nParting Question\n\n\nFrom your perspective, what is the biggest gap in the tooling or technology for data management today?\n\n\nLinks\n\n\ndata.world\nHomeAway\nSemantic Web\nKnowledge Engineering\nOntology\nOpen Data\nRDF\nCSVW\nSPARQL\nDBPedia\nTriplestore\nHeader Dictionary Triples\nApache Jena\nTabula\nTableau Connector\nExcel Connector\nData For Democracy\nJonathan Morgan\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"We have tools and platforms for collaborating on software projects and linking them together, wouldn’t it be nice to have the same capabilities for data? The team at data.world are working on building a platform to host and share data sets for public and private use that can be linked together to build a semantic web of information. The CTO, Bryon Jacob, discusses how the company got started, their mission, and how they have built and evolved their technical infrastructure.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"Data.World: The Platform For The Web Of Linked Data (Interview)","date_published":"2017-12-02T22:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/04c238c2-78a3-4d35-a9c3-a697fba495af.mp3","mime_type":"audio/mpeg","size_in_bytes":34176774,"duration_in_seconds":2784}]},{"id":"podlove-2017-11-22t11:22:25+00:00-d1acaf96838517e","title":"Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8","url":"https://www.dataengineeringpodcast.com/data-serialization-with-doug-cutting-and-julien-le-dem-episode-8","content_text":"Summary\nWith the wealth of formats for sending and storing data it can be difficult to determine which one to use. In this episode Doug Cutting, creator of Avro, and Julien Le Dem, creator of Parquet, dig into the different classes of serialization formats, what their strengths are, and how to choose one for your workload. They also discuss the role of Arrow as a mechanism for in-memory data sharing and how hardware evolution will influence the state of the art for data formats.\nPreamble\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nContinuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nThis is your host Tobias Macey and today I’m interviewing Julien Le Dem and Doug Cutting about data serialization formats and how to pick the right one for your systems.\n\nInterview\n\nIntroduction\nHow did you first get involved in the area of data management?\nWhat are the main serialization formats used for data storage and analysis?\nWhat are the tradeoffs that are offered by the different formats?\nHow have the different storage and analysis tools influenced the types of storage formats that are available?\nYou’ve each developed a new on-disk data format, Avro and Parquet respectively. What were your motivations for investing that time and effort?\nWhy is it important for data engineers to carefully consider the format in which they transfer their data between systems?\n\nWhat are the switching costs involved in moving from one format to another after you have started using it in a production system?\n\n\nWhat are some of the new or upcoming formats that you are each excited about?\nHow do you anticipate the evolving hardware, patterns, and tools for processing data to influence the types of storage formats that maintain or grow their popularity?\n\nContact Information\n\nDoug:\n\ncutting on GitHub\nBlog\n@cutting on Twitter\n\n\nJulien\n\nEmail\n@J_ on Twitter\nBlog\njulienledem on GitHub\n\n\n\nLinks\n\nApache Avro\nApache Parquet\nApache Arrow\nHadoop\nApache Pig\nXerox Parc\nExcite\nNutch\nVertica\nDremel White Paper\n\nTwitter Blog on Release of Parquet\n\n\nCSV\nXML\nHive\nImpala\nPresto\nSpark SQL\nBrotli\nZStandard\nApache Drill\nTrevni\nApache Calcite\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n","content_html":"With the wealth of formats for sending and storing data it can be difficult to determine which one to use. In this episode Doug Cutting, creator of Avro, and Julien Le Dem, creator of Parquet, dig into the different classes of serialization formats, what their strengths are, and how to choose one for your workload. They also discuss the role of Arrow as a mechanism for in-memory data sharing and how hardware evolution will influence the state of the art for data formats.
\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Buzzfeed needs to be able to understand how its users are interacting with the myriad articles, videos, etc. that they are posting. This lets them produce new content that will continue to be well-received. To surface the insights that they need to grow their business they need a robust data infrastructure to reliably capture all of those interactions. Walter Menendez is a data engineer on their infrastructure team and in this episode he describes how they manage data ingestion from a wide array of sources and create an interface for their data scientists to produce valuable conclusions.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"","date_published":"2017-11-14T16:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/98f95c5e-b083-44b8-abfa-3a50913fb5fb.mp3","mime_type":"audio/mpeg","size_in_bytes":17832626,"duration_in_seconds":2620}]},{"id":"podlove-2017-08-06t09:26:59+00:00-1a858828690c96f","title":"Astronomer with Ry Walker - Episode 6","url":"https://www.dataengineeringpodcast.com/astronomer-with-ry-walker-episode-6","content_text":"Summary\n\nBuilding a data pipeline that is reliable and flexible is a difficult task, especially when you have a small team. Astronomer is a platform that lets you skip straight to processing your valuable business data. Ry Walker, the CEO of Astronomer, explains how the company got started, how the platform works, and their commitment to open source.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at www.dataengineeringpodcast.com/linode?utm_source=rss&utm_medium=rss and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nThis is your host Tobias Macey and today I’m interviewing Ry Walker, CEO of Astronomer, the platform for data engineering.\n\n\nInterview\n\n\nIntroduction\nHow did you first get involved in the area of data management?\nWhat is Astronomer and how did it get started?\nRegulatory challenges of processing other people’s data\nWhat does your data pipelining architecture look like?\nWhat are the most challenging aspects of building a general purpose data management environment?\nWhat are some of the most significant sources of technical debt in your platform?\nCan you share some of the failures that you have encountered while architecting or building your platform and company and how you overcame them?\nThere are certain areas of the overall data engineering workflow that are well defined and have numerous tools to choose from. What are some of the unsolved problems in data management?\nWhat are some of the most interesting or unexpected uses of your platform that you are aware of?\n\n\nContact Information\n\n\nEmail\n@rywalker on Twitter\n\n\nLinks\n\n\nAstronomer\nKiss Metrics\nSegment\nMarketing tools chart\nClickstream\nHIPAA\nFERPA\nPCI\nMesos\nMesos DC/OS\nAirflow\nSSIS\nMarathon\nPrometheus\nGrafana\nTerraform\nKafka\nSpark\nELK Stack\nReact\nGraphQL\nPostGreSQL\nMongoDB\nCeph\nDruid\nAries\nVault\nAdapter Pattern\nDocker\nKinesis\nAPI Gateway\nKong\nAWS Lambda\nFlink\nRedshift\nNOAA\nInformatica\nSnapLogic\nMeteor\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"Building a data pipeline that is reliable and flexible is a difficult task, especially when you have a small team. Astronomer is a platform that lets you skip straight to processing your valuable business data. Ry Walker, the CEO of Astronomer, explains how the company got started, how the platform works, and their commitment to open source.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"","date_published":"2017-08-06T05:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/011800ca-593b-4203-9d3e-8d3358637e5e.mp3","mime_type":"audio/mpeg","size_in_bytes":59881944,"duration_in_seconds":2570}]},{"id":"podlove-2017-06-17t01:30:26+00:00-59d95156935a241","title":"Rebuilding Yelp's Data Pipeline with Justin Cunningham - Episode 5","url":"https://www.dataengineeringpodcast.com/episode-5-rebuilding-yelps-data-pipeline-with-justin-cunningham","content_text":"Summary\n\nYelp needs to be able to consume and process all of the user interactions that happen in their platform in as close to real-time as possible. To achieve that goal they embarked on a journey to refactor their monolithic architecture to be more modular and modern, and then they open sourced it! In this episode Justin Cunningham joins me to discuss the decisions they made and the lessons they learned in the process, including what worked, what didn’t, and what he would do differently if he was starting over today.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nWhen you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at www.dataengineeringpodcast.com/linode?utm_source=rss&utm_medium=rss and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nYour host is Tobias Macey and today I’m interviewing Justin Cunningham about Yelp’s data pipeline\n\n\nInterview with Justin Cunningham\n\n\nIntroduction\nHow did you get involved in the area of data engineering?\nCan you start by giving an overview of your pipeline and the type of workload that you are optimizing for?\nWhat are some of the dead ends that you experienced while designing and implementing your pipeline?\nAs you were picking the components for your pipeline, how did you prioritize the build vs buy decisions and what are the pieces that you ended up building in-house?\nWhat are some of the failure modes that you have experienced in the various parts of your pipeline and how have you engineered around them?\nWhat are you using to automate deployment and maintenance of your various components and how do you monitor them for availability and accuracy?\nWhile you were re-architecting your monolithic application into a service oriented architecture and defining the flows of data, how were you able to make the switch while verifying that you were not introducing unintended mutations into the data being produced?\nDid you plan to open-source the work that you were doing from the start, or was that decision made after the project was completed? What were some of the challenges associated with making sure that it was properly structured to be amenable to making it public?\nWhat advice would you give to anyone who is starting a brand new project and how would that advice differ for someone who is trying to retrofit a data management architecture onto an existing project? \n\n\nKeep in touch\n\n\nYelp Engineering Blog\nEmail\n\n\nLinks\n\n\nKafka\nRedshift\nETL\nBusiness Intelligence\nChange Data Capture\nLinkedIn Data Bus\nApache Storm\nApache Flink\nConfluent\nApache Avro\nGame Days\nChaos Monkey\nSimian Army\nPaaSta\nApache Mesos\nMarathon\nSignalFX\nSensu\nThrift\nProtocol Buffers\nJSON Schema\nDebezium\nKafka Connect\nApache Beam\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"Yelp needs to be able to consume and process all of the user interactions that happen in their platform in as close to real-time as possible. To achieve that goal they embarked on a journey to refactor their monolithic architecture to be more modular and modern, and then they open sourced it! In this episode Justin Cunningham joins me to discuss the decisions they made and the lessons they learned in the process, including what worked, what didn’t, and what he would do differently if he was starting over today.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"","date_published":"2017-06-17T23:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/049a093d-67d9-4c9d-a0f6-405a89bb8b4a.mp3","mime_type":"audio/mpeg","size_in_bytes":70067587,"duration_in_seconds":2547}]},{"id":"podlove-2017-03-18t11:11:04+00:00-e27681bd450533b","title":"ScyllaDB with Eyal Gutkind - Episode 4","url":"https://www.dataengineeringpodcast.com/episode-4-scylladb-with-eyal-gutkind","content_text":"Summary\n\nIf you like the features of Cassandra DB but wish it ran faster with fewer resources then ScyllaDB is the answer you have been looking for. In this episode Eyal Gutkind explains how Scylla was created and how it differentiates itself in the crowded database market.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nYour host is Tobias Macey and today I’m interviewing Eyal Gutkind about ScyllaDB\n\n\nInterview\n\n\nIntroduction\nHow did you get involved in the area of data management?\nWhat is ScyllaDB and why would someone choose to use it?\nHow do you ensure sufficient reliability and accuracy of the database engine?\nThe large draw of Scylla is that it is a drop in replacement of Cassandra with faster performance and no requirement to manage th JVM. What are some of the technical and architectural design choices that have enabled you to do that?\nDeployment and tuning\nWhat challenges are inroduced as a result of needing to maintain API compatibility with a diferent product?\nDo you have visibility or advance knowledge of what new interfaces are being added to the Apache Cassandra project, or are you forced to play a game of keep up?\nAre there any issues with compatibility of plugins for CassandraDB running on Scylla?\nFor someone who wants to deploy and tune Scylla, what are the steps involved?\nIs it possible to join a Scylla cluster to an existing Cassandra cluster for live data migration and zero downtime swap?\nWhat prompted the decision to form a company around the database?\nWhat are some other uses of Seastar?\n\n\nKeep in touch\n\n\nEyal\n\nLinkedIn\n\n\n\nScyllaDB\n\n\nWebsite\n@ScyllaDB on Twitter\nGitHub\nMailing List\nSlack\n\n\n\n\n\nLinks\n\n\nSeastar Project\nDataStax\nXFS\nTitanDB\nOpenTSDB\nKairosDB\nCQL\nPedis\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"If you like the features of Cassandra DB but wish it ran faster with fewer resources then ScyllaDB is the answer you have been looking for. In this episode Eyal Gutkind explains how Scylla was created and how it differentiates itself in the crowded database market.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"","date_published":"2017-03-18T07:00:00.000-04:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/731d2aec-f184-42b9-849d-fdf9e8edf0eb.mp3","mime_type":"audio/mpeg","size_in_bytes":37751819,"duration_in_seconds":2106}]},{"id":"podlove-2017-03-05t02:45:25+00:00-648c6cb8078c5d9","title":"Defining Data Engineering with Maxime Beauchemin - Episode 3","url":"https://www.dataengineeringpodcast.com/episode-3-defining-data-engineering-with-maxime-beauchemin","content_text":"Summary\n\nWhat exactly is data engineering? How has it evolved in recent years and where is it going? How do you get started in the field? In this episode, Maxime Beauchemin joins me to discuss these questions and more.\n\nTranscript provided by CastSource\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nYour host is Tobias Macey and today I’m interviewing Maxime Beauchemin\n\n\nQuestions\n\n\nIntroduction\nHow did you get involved in the field of data engineering?\nHow do you define data engineering and how has that changed in recent years?\nDo you think that the DevOps movement over the past few years has had any impact on the discipline of data engineering? If so, what kinds of cross-over have you seen?\nFor someone who wants to get started in the field of data engineering what are some of the necessary skills?\nWhat do you see as the biggest challenges facing data engineers currently?\nAt what scale does it become necessary to differentiate between someone who does data engineering vs data infrastructure and what are the differences in terms of skill set and problem domain?\nHow much analytical knowledge is necessary for a typical data engineer?\nWhat are some of the most important considerations when establishing new data sources to ensure that the resulting information is of sufficient quality?\nYou have commented on the fact that data engineering borrows a number of elements from software engineering. Where does the concept of unit testing fit in data management and what are some of the most effective patterns for implementing that practice?\nHow has the work done by data engineers and managers of data infrastructure bled back into mainstream software and systems engineering in terms of tools and best practices?\nHow do you see the role of data engineers evolving in the next few years?\n\n\nKeep In Touch\n\n\n@mistercrunch on Twitter\nmistercrunch on GitHub\nMedium\n\n\nLinks\n\n\nDatadog\nAirflow\nThe Rise of the Data Engineer\nDruid.io\nLuigi\nApache Beam\nSamza\nHive\nData Modeling\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"What exactly is data engineering? How has it evolved in recent years and where is it going? How do you get started in the field? In this episode, Maxime Beauchemin joins me to discuss these questions and more.
\n\nTranscript provided by CastSource
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"","date_published":"2017-03-04T21:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/c9bdfeec-5f43-47c3-9535-811c2c8f953c.mp3","mime_type":"audio/mpeg","size_in_bytes":72883818,"duration_in_seconds":2720}]},{"id":"podlove-2017-01-22t16:54:47+00:00-69ad1641ced993b","title":"Dask with Matthew Rocklin - Episode 2","url":"https://www.dataengineeringpodcast.com/episode-2-dask-with-matthew-rocklin","content_text":"Summary\n\nThere is a vast constellation of tools and platforms for processing and analyzing your data. In this episode Matthew Rocklin talks about how Dask fills the gap between a task oriented workflow tool and an in memory processing framework, and how it brings the power of Python to bear on the problem of big data.\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nYour host is Tobias Macey and today I’m interviewing Matthew Rocklin about Dask and the Blaze ecosystem.\n\n\nInterview with Matthew Rocklin\n\n\nIntroduction\nHow did you get involved in the area of data engineering?\nDask began its life as part of the Blaze project. Can you start by describing what Dask is and how it originated?\nThere are a vast number of tools in the field of data analytics. What are some of the specific use cases that Dask was built for that weren’t able to be solved by the existing options?\nOne of the compelling features of Dask is the fact that it is a Python library that allows for distributed computation at a scale that has largely been the exclusive domain of tools in the Hadoop ecosystem. Why do you think that the JVM has been the reigning platform in the data analytics space for so long?\nDo you consider Dask, along with the larger Blaze ecosystem, to be a competitor to the Hadoop ecosystem, either now or in the future?\nAre you seeing many Hadoop or Spark solutions being migrated to Dask? If so, what are the common reasons?\nThere is a strong focus for using Dask as a tool for interactive exploration of data. How does it compare to something like Apache Drill?\nFor anyone looking to integrate Dask into an existing code base that is already using NumPy or Pandas, what does that process look like?\nHow do the task graph capabilities compare to something like Airflow or Luigi?\nLooking through the documentation for the graph specification in Dask, it appears that there is the potential to introduce cycles or other bugs into a large or complex task chain. Is there any built-in tooling to check for that before submitting the graph for execution?\nWhat are some of the most interesting or unexpected projects that you have seen Dask used for?\nWhat do you perceive as being the most relevant aspects of Dask for data engineering/data infrastructure practitioners, as compared to the end users of the systems that they support?\nWhat are some of the most significant problems that you have been faced with, and which still need to be overcome in the Dask project?\nI know that the work on Dask is largely performed under the umbrella of PyData and sponsored by Continuum Analytics. What are your thoughts on the financial landscape for open source data analytics and distributed computation frameworks as compared to the broader world of open source projects?\n\n\nKeep in touch\n\n\n@mrocklin on Twitter\nmrocklin on GitHub\n\n\nLinks\n\n\nhttp://matthewrocklin.com/blog/work/2016/09/22/cluster-deployments?utm_source=rss&utm_medium=rss\nhttps://opendatascience.com/blog/dask-for-institutions/?utm_source=rss&utm_medium=rss\nContinuum Analytics\n2sigma\nX-Array\nTornado\n\nWebsite\nPodcast Interview\n\n\n\nAirflow\nLuigi\nMesos\nKubernetes\nSpark\nDryad\nYarn\nRead The Docs\nXData\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"There is a vast constellation of tools and platforms for processing and analyzing your data. In this episode Matthew Rocklin talks about how Dask fills the gap between a task oriented workflow tool and an in memory processing framework, and how it brings the power of Python to bear on the problem of big data.
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"","date_published":"2017-01-22T10:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/0df71046-f709-43e1-a39f-8447e751024a.mp3","mime_type":"audio/mpeg","size_in_bytes":32558080,"duration_in_seconds":2760}]},{"id":"podlove-2017-01-14t18:36:08+00:00-fee8f4d3c9604ae","title":"Pachyderm with Daniel Whitenack - Episode 1","url":"https://www.dataengineeringpodcast.com/epsiode-1-pachyderm-with-daniel-whitenack","content_text":"Summary\n\nDo you wish that you could track the changes in your data the same way that you track the changes in your code? Pachyderm is a platform for building a data lake with a versioned file system. It also lets you use whatever languages you want to run your analysis with its container based task graph. This week Daniel Whitenack shares the story of how the project got started, how it works under the covers, and how you can get started using it today!\n\nPreamble\n\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers\nYour host is Tobias Macey and today I’m interviewing Daniel Whitenack about Pachyderm, a modern container based system for building and analyzing a versioned data lake.\n\n\nInterview with Daniel Whitenack\n\n\nIntroduction\nHow did you get started in the data engineering space?\nWhat is pachyderm and what problem were you trying to solve when the project was started?\nWhere does the name come from?\nWhat are some of the competing projects in the space and what features does Pachyderm offer that would convince someone to choose it over the other options?\nBecause of the fact that the analysis code and the data that it acts on are all versioned together it allows for tracking the provenance of the end result. Why is this such an important capability in the context of data engineering and analytics?\nWhat does Pachyderm use for the distribution and scaling mechanism of the file system?\nGiven that you can version your data and track all of the modifications made to it in a manner that allows for traversal of those changesets, how much additional storage is necessary over and above the original capacity needed for the raw data?\nFor a typical use of Pachyderm would someone keep all of the revisions in perpetuity or are the changesets primarily just useful in the context of an analysis workflow?\nGiven that the state of the data is calculated by applying the diffs in sequence what impact does that have on processing speed and what are some of the ways of mitigating that?\nAnother compelling feature of Pachyderm is the fact that it natively supports the use of any language for interacting with your data. Why is this such an important capability and why is it more difficult with alternative solutions?\n\nHow did you implement this feature so that it would be maintainable and easy to implement for end users?\n\n\n\nGiven that the intent of using containers is for encapsulating the analysis code from experimentation through to production, it seems that there is the potential for the implementations to run into problems as they scale. What are some things that users should be aware of to help mitigate this?\nThe data pipeline and dependency graph tooling is a useful addition to the combination of file system and processing interface. Does that preclude any requirement for external tools such as Luigi or Airflow?\nI see that the docs mention using the map reduce pattern for analyzing the data in Pachyderm. Does it support other approaches such as streaming or tools like Apache Drill?\nWhat are some of the most interesting deployments and uses of Pachyderm that you have seen?\nWhat are some of the areas that you are looking for help from the community and are there any particular issues that the listeners can check out to get started with the project?\n\n\nKeep in touch\n\n\nDaniel\n\nTwitter – @dwhitena\n\n\n\nPachyderm\n\n\nWebsite\n\n\n\n\n\nFree Weekend Project\n\n\nGopherNotes\n\n\nLinks\n\n\nAirBnB\nRethinkDB\nFlocker\nInfinite Project\nGit LFS\nLuigi\nAirflow\nKafka\nKubernetes\nRkt\nSciKit Learn\nDocker\nMinikube\nGeneral Fusion\n\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA","content_html":"Do you wish that you could track the changes in your data the same way that you track the changes in your code? Pachyderm is a platform for building a data lake with a versioned file system. It also lets you use whatever languages you want to run your analysis with its container based task graph. This week Daniel Whitenack shares the story of how the project got started, how it works under the covers, and how you can get started using it today!
\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
","summary":"","date_published":"2017-01-14T13:00:00.000-05:00","attachments":[{"url":"https://op3.dev/e/dts.podtrac.com/redirect.mp3/aphid.fireside.fm/d/1437767933/c6161a3f-a67b-48ef-b087-52f1f1573292/77759ae9-66c8-439b-90e8-1d9326279db6.mp3","mime_type":"audio/mpeg","size_in_bytes":42922090,"duration_in_seconds":2682}]},{"id":"podlove-2017-01-08t04:07:58+00:00-8f103a06ef5f7c5","title":"Introducing The Show","url":"https://www.dataengineeringpodcast.com/episode-0-introducing-the-show","content_text":"\nPreamble\n\nHello and welcome to the Data Engineering Podcast, the show about modern data infrastructure\nGo to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.\nYou can help support the show by checking out the Patreon page which is linked from the site.\nTo help other people find the show you can leave a review on iTunes, or Google Play Music, share it on social media, and tell your friends and co-workers.\nI’m your host, Tobias Macey, and today I’m speaking with Maxime Beauchemin about what it means to be a data engineer.\n\nInterview\n\nWho am I\nSystems administrator and software engineer, now DevOps, focus on automation\nHost of Podcast.__init__\nHow did I get involved in data management\nWhy am I starting a podcast about Data Engineering\nInteresting area with a lot of activity\nNot currently any shows focused on data engineering\nWhat kinds of topics do I want to cover\nData stores\nPipelines\nTooling\nAutomation\nMonitoring\nTesting\nBest practices\nCommon challenges\nDefining the role/job hunting\nRelationship with data engineers/data analysts\nGet in touch and subscribe\nWebsite\nNewsletter\nTwitter\nEmail\n\nThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA\n\n\n","content_html":"The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
\n