Migrate And Modify Your Data Platform Confidently With Compilerworks - Episode 213

Summary

A major concern that comes up when selecting a vendor or technology for storing and managing your data is vendor lock-in. What happens if the vendor fails? What if the technology can’t do what I need it to? Compilerworks set out to reduce the pain and complexity of migrating between platforms, and in the process added an advanced lineage tracking capability. In this episode Shevek, CTO of Compilerworks, takes us on an interesting journey through the many technical and social complexities that are involved in evolving your data platform and the system that they have built to make it a manageable task.

Databand LogoDataband.ai is a unified Data Observability Platform that helps DataOps teams catch and solve data health issues fast. Databand.ai’s platform helps data engineers pinpoint pipeline issues and quickly identify their root cause so DataOps can begin working on a resolution before bad data is delivered. Whether you’re using Apache Spark, Apache Airflow, Databricks, Amazon S3, self-hosted python scripts, or combinations of these, Databand.ai allows you to monitor data health along every step of its journey. Powerful integrations to 20+ tools gives you full visibility of your stack. Our mission is to help businesses trust their data with the most powerful Data Observability Platform. Experience unified observability with a free trial today: www.databand.ai


Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!


Census LogoCensus is the operational analytics platform that syncs your cloud warehouse with all the SaaS applications used by your Sales, Marketing & Success teams. If you need to get your company data into Salesforce, Marketo, Hubspot, Intercom, Zendesk, and other tools, Census is the easiest way to do so. Just write SQL (or plug in your dbt models), set up the sync frequencies, and voila, your data is now available to be used by all of your teams.  No need to worry about incremental sync, backfilling, API quota management, API versioning, monitoring, and maintaining custom scripts. Just SQL. Start your free 14-day trial now.


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today.
  • We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.
  • Your host is Tobias Macey and today I’m interviewing Shevek about Compilerworks and his work on writing compilers to automate data lineage tracking from your SQL code

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Compilerworks is and the story behind it?
  • What is a compiler?
    • How are you applying compilers to the challenges of data processing systems?
  • What are some use cases that Compilerworks is uniquely well suited to?
  • There are a number of other methods and systems available for tracking and/or computing data lineage. What are the benefits of the approach that you are taking with Compilerworks?
  • Can you describe the design and implementation of the Compilerworks platform?
    • How has the system changed or evolved since you first began working on it?
  • What programming languages and SQL dialects do you currently support?
    • Which have been the most challenging to work with?
    • How do you handle verification/validation of the algebraic representation of SQL code given the variability of implementations and the flexibility of the specification?
  • Can you talk through the process of getting Compilerworks integrated into a customer’s infrastructure?
    • What is a typical workflow for someone using Compilerworks to manage their data lineage?
  • How does Compilerworks simplify the process of migrating between data warehouses/processing platforms?
  • What are the most interesting, innovative, or unexpected ways that you have seen Compilerworks used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Compilerworks?
  • When is Compilerworks the wrong choice?
  • What do you have planned for the future of Compilerworks?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Liked it? Take a second to support the Data Engineering Podcast on Patreon!