Sponsors

The Data Engineering Podcast is supported by some great companies. Check out the sponsors that have helped to bring you the stories behind the projects that you use every day and the people who built them.

Current Sponsors

Atlan LogoHave you ever woken up to a crisis because a number on a dashboard is broken and no one knows why? Or sent out frustrating slack messages trying to find the right data set? Or tried to understand what a column name means?

Our friends at Atlan started out as a data team themselves and faced all this collaboration chaos themselves, and started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more.

Go to dataengineeringpodcast.com/atlan and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription.


Unstruk LogoUnstruk Data offers an API-driven solution to simplify the process of transforming unstructured data files into actionable intelligence about real-world assets without writing a line of code – putting insights generated from this data at enterprise teams’ fingertips. The company was founded in 2021 by Kirk Marple after his tenure as CTO of Kespry. Kirk possesses extensive industry knowledge including over 25 years of experience building and architecting scalable SaaS platforms and applications, prior successful startup exits, and deep unstructured and perception data experience. Unstruk investors include 8VC, Preface Ventures, Valia Ventures, Shell Ventures and Stage Venture Partners.

Go to dataengineeringpodcast.com/unstruk today to transform your messy collection of unstructured data files into actionable assets that power your business!


Sifflet is a Full Data Stack Observability platform acting as an overseeing layer to the Data Stack, ensuring that data is reliable from ingestion to consumption. Whether the data is in transit or at rest, Sifflet is able to detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack.

In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get 2000$ to use as platform credits when signing up to use Sifflet. We also offer a 2-week free trial.

Go to dataengineeringpodcast.com/sifflet to find out more.


Tonic.ai matches development and staging environments to production by rapidly equipping teams with high-quality data at scale. With regulations and breaches on the rise, production data is no longer safe (or legal) for developers to use, but creating test data in-house is a complex chore that eats into valuable engineering resources. With Tonic, teams no longer need to choose between productivity and security—they get both rapidly and with ease. Shorten your development cycle, eliminate the need for cumbersome data pipeline work, and mathematically guarantee the privacy of your data. Through its data de-identification, advanced subsetting, and synthetic scaling technologies, Tonic makes it possible to create a true mirror of production in the safety of a developer landscape so you can work on real product and steer clear of surprises at release time.

Go to dataengineeringpodcast.com/tonic to sign up for a free 2-week sandbox account and give Tonic.ai a try!


Select Star LogoSo now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data.

From analyzing your metadata, query logs, and dashboard activities, Select Star will automatically document your datasets. For every table in Select Star, you can find out where the data originated from, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use.

With Select Star’s data catalog, a single source of truth in data is built in minutes, even across thousands of datasets.

Try it out for free at dataengineeringpodcast.com/selectstar. If you’re a data engineering podcast subscriber, we’ll double the length of your free trial and send you a swag package when you continue on a paid plan.


Bigeye LogoBigeye is an industry-leading data observability platform that gives data engineering and science teams the tools they need to ensure their data is always fresh, accurate and reliable. Companies like Instacart, Clubhouse, and Udacity use Bigeye’s automated data quality monitoring, ML-powered anomaly detection, and granular root cause analysis to proactively detect and resolve issues before they impact the business.

Go to dataengineeringpodcast.com/bigeye today and start trusting your data.


Datafold LogoDatafold helps you deal with data quality in your pull request. It provides automated regression testing throughout your schema and pipelines so you can address quality issues before they affect production. No more shipping and praying, you can now know exactly what will change in your database ahead of time.

Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI, so in a few minutes you can get from 0 to automated testing of your analytical code. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.


RudderStack LogoRudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.

RudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team.
RudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again.
Visit dataengineeringpodcast.com/rudder to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.

Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!


Shipyard is an orchestration platform that helps data teams build out solid data operations from the get-go by connecting data tools and streamlining data workflows. Shipyard offers low-code templates that are configured using a visual interface, replacing the need to write code to build workflows while enabling engineers to get their work into production faster. If a solution can’t be built with existing templates, engineers can always automate scripts in the language of their choice to bring any internal or external process into their workflows.

Observability and alerting are built into the Shipyard platform, ensuring that breakages are identified before being discovered downstream by business teams. With a high level of concurrency, scalability, and end-to-end encryption, Shipyard enables data teams to accomplish more without relying on other teams or worrying about infrastructure challenges, while also ensuring that business teams trust the data made available to them. Go to dataengineeringpodcast.com/shipyard to get started automating powerful workflows with their free developer plan today!


Prefect LogoPrefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it.
Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.


Ascend.io logo Ascend.io, the Data Automation Cloud, provides the most advanced automation for data and analytics engineering workloads. Ascend.io unifies the core capabilities of data engineering—data ingestion, transformation, delivery, orchestration, and observability—into a single platform so that data teams deliver 10x faster. With 95% of data teams already at or over capacity, engineering productivity is a top priority for enterprises. Ascend’s Flex-code user interface empowers any member of the data team—from data engineers to data scientists to data analysts—to quickly and easily build and deliver on the data and analytics workloads they need. And with Ascend’s DataAware™ intelligence, data teams no longer spend hours carefully orchestrating brittle data workloads and instead rely on advanced automation to optimize the entire data lifecycle. Ascend.io runs natively on data lakes and warehouses and in AWS, Google Cloud and Microsoft Azure.

Go to dataengineeringpodcast.com/ascend to find out more.

 

 


Past Sponsors

Integrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin.


Altinity LogoEnabling real-time analytics is a huge task. Without a data warehouse that outperforms the demands of your customers at a fraction of cost and time, this big task can also prove challenging. But it doesn’t have to be tiring or difficult with ClickHouse — an open-source analytical database that deploys and scales wherever and whenever you want it to and turns data into actionable revenue. And Altinity is the leading ClickHouse software and service provider on a mission to help data engineers and DevOps managers. Go to dataengineeringpodcast.com/altinity to find out how with a free consultation.


Managing data in your warehouse today is hard. Often you’ll find yourself relying on manual work and hacks to get something done. Data knowledge is fragmented across the team and the data is often unreliable. So you hire a data engineer and they spend all their time managing this custom infrastructure when what they want to be doing is focusing on writing code.

Dataform enables data teams to own the end to end data transformation process by giving them a collaborative web based platform where they can develop SQL code with real time error checking, run schedules, write data quality tests and use a data catalog to document.

Enabling analysts to publish tables and maintain complex SQL workflows without requiring the help of engineers. And letting data engineers focus their time on transformation code instead of having to maintain custom infrastructure.

Get a free hands on trial with a data expert today! Sign up at dataengineeringpodcast.com/dataform, then email team@dataform.co with the subject “Data Engineering Podcast”.


Immuta LogoBusinesses are increasingly faced with the challenge of satisfying several, often conflicting, demands regarding sensitive data. From sharing and using sensitive data to complying with regulations and navigating new cloud-based platforms, Immuta helps solve these needs and more.

With automated, scalable data access and privacy controls, and enhanced collaboration between data and compliance teams, Immuta empowers data teams to easily access the data they need, when they need it – all while protecting sensitive data and ensuring their customers’ privacy. Immuta integrates with leading technology and solutions providers so you can govern your data on your desired analytic system.

Start a free trial of Immuta​ to see the power of automated data governance for yourself.


Many data engineers say the most frustrating part of their job is spending too much time maintaining and monitoring their data pipeline. Snowplow works with data-informed businesses to set up a real-time event data pipeline, taking care of installation, upgrades, autoscaling, and ongoing maintenance so you can focus on the data.

Snowplow runs in your own cloud account giving you complete control and flexibility over how your data is collected and processed. Best of all, Snowplow is built on top of open source technology which means you have visibility into every stage of your pipeline, with zero vendor lock in.

At Snowplow, we know how important it is for data engineers to deliver high-quality data across the organization. That’s why the Snowplow pipeline is designed to deliver complete, rich and accurate data into your data warehouse of choice. Your data analysts define the data structure that works best for your teams, and we enforce it end-to-end so your data is ready to use.

Get in touch with our team to find out how Snowplow can accelerate your analytics. Go to dataengineeringpodcast.com/snowplow. Set up a demo and mention you’re a listener for a special offer!


Config Cat LogoFeature flagging is a simple concept that enables you to ship features faster, test in production, and do easy rollbacks without redeploying code. Teams using feature flags release new features with less risk, and release more often. Developers using feature flags need to merge less.

This episode is sponsored ConfigCat.

  • Easily use flags in your code with ConfigCat libraries for Python and 9 other platforms.
  • Toggle your feature flags visually on the visual Dashboard.
  • Hide or expose features in your application without redeploying code.
  • Set targeting rules to allow you to control who has access to new features.

ConfigCat allows you to get features out faster, test in production, and do easy rollbacks.

With ConfigCat’s simple API and clear documentation, you’ll have your initial proof of concept up and running in minutes. Train new team members in minutes also, and you don’t pay extra for team size. With the simple UI, the whole team can use it effectively.

Whether you are an individual or team you can try it out with their forever free plan. Or get 35% off any paid plan with code DATAENGINEERING

Release features faster with less risk with ConfigCat. Check them out at today at dataengineeringpodcast.com/configcat


Monte Carlo LogoStruggling with broken pipelines? Stale dashboards? Missing data?

If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform!

Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today!

Visit http://www.dataengineeringpodcast.com/montecarlo to learn more.

 


Firebolt LogoFirebolt is the world’s fastest cloud data warehouse, purpose-built for high performance analytics. It provides orders of magnitude faster query performance at a fraction of the cost compared to alternatives. Companies that adopted Firebolt have been able to deploy data warehouses in weeks and deliver sub-second performance at terabyte to petabyte scale for a wide range of interactive, high performance analytics across internal BI as well as customer facing analytics use cases. Visit dataengineeringpodcast.com/firebolt to get started.


Listen, I’m sure you work for a ‘data driven’ company – who doesn’t these days?? Does your company use Amazon Redshift? Have you ever groaned over slow queries or are just afraid that Amazon Redshift is gonna fall over at some point??

Well, you GOTTA talk to the folks over at intermix.io. They have built the “missing” Amazon Redshift console – it’s an amazing analytics product for data engineers to find and re-write slow queries and gives actionable recommendations to optimize data pipelines. WeWork, Postmates, and Medium are just a few of their customers.

DEP listeners get a $50 discount! Just go to dataengineeringpodcast.com/intermix and use promo code DEP at sign up.


Hightouch LogoHightouch is the leading Reverse ETL platform. Your data warehouse is your source of truth for customer data. Hightouch syncs this data to the tools that your business teams rely on. Hightouch has a catalog of flexible destinations including Salesforce, HubSpot, Zendesk, NetSuite, and ad platforms like Facebook or Google. Hightouch is built for data engineers and is a natural extension to the modern data stack with out-of-the-box integrations with your favorite tools like dbt, Fivetran, Airflow, Slack, PagerDuty, and DataDog.

It’s simple — connect your data warehouse, paste a SQL query, and use our visual mapper to specify how data should appear in downstream tools. No scripts, just SQL. Get started for free at dataengineeringpodcast.com/hightouch


Pipeline Data Engineering Academy LogoPipeline Data Engineering Academy: Learn Data Craftsmanship Beyond The AI-Hype

A cohort-based online course where you’ll learn the fundamentals of building sustainable data infrastructures that power data products, business intelligence and machine learning systems. We’re also the world’s first data engineering bootcamp, led by industry experts.

Experience collaboration and pragmatism within the data world, engage in real-life engineering problems and solve them while keeping an eye on sustainability factors across the board.

Expect to get your hands dirty and learn how to solve real challenges through best practices. Join other data enthusiasts with diverse backgrounds to experience the software systems that power the most innovative tech products and digital platforms of tomorrow.

Take part in career coaching sessions, expert AMAs and communication training and you’ll get access to a meaningful network of professionals and organisations in the data ecosystem.
Learn more at dataengineeringpodcast.com/academy

PostHog LogoPostHog is an open source, product analytics platform. PostHog enables software teams to understand user behavior – auto-capturing events, performing product analytics and dashboarding, enabling video replays, and rolling out new features behind feature flags, all based on their single open source platform. The product’s open source approach enables companies to self-host, removing the need to send data externally. Try it out today at dataengineeringpodcast.com/posthog


What happens when your expanding log & event data threatens to topple your Elasticsearch strategy? Whether you’re running your own ELK Stack or leveraging an Elasticsearch-based service, unexpected costs and data retention limits quickly mount.  Now try CHAOSSEARCH.  Run your entire logging infrastructure on your AWS S3.  Never move your data. Fully managed service.  Half the cost of Elasticsearch. Check out this short video overview of CHAOSSEARCH today!  Forget Elasticsearch! Try  – search analytics on your AWS S3.


Have you ever found yourself lost in a pile of directories, each only differing by a cryptic and poorly considered version number, wishing that you could just dump it all into your source control system to track changes and change history? Lucky for you the fine folks at Quilt Data were in the same boat and decided to build something just for you! Quilt is an open source platform for managing your data sets in the same way that you manage your software. It includes metadata management, version history, and distributed delivery so that you can build a workflow that works for your whole team.

Stop by their booth at JupyterCon in New York City on August 22nd through the 24th to say Hi and tell them that the Data Engineering Podcast sent you! After that, keep an eye on the AWS marketplace for a pre-packaged version of Quilt for Teams to deploy into your own environment and stop fighting with your data.


Tidy Data LogoTidy Data is a monitoring platform to help you monitor your data pipeline. Custom in-house solutions are costly, laborious, and fragile. Replacing them with Tidy Data’s consistent managed data ops platform will solve these issues. Monitor your data pipeline like you monitor your website. It’s like pingdom for data. No credit card required to sign up. Go to dataengineeringpodcast.com/tidydata today to get started with their free tier.


strongDM logo

 

strongDM enables you to easily manage and audit access to databases and servers. Leading organizations including Hearst, SoFi, and Peloton rely on strongDM to eliminate the manual-heavy work required to onboard, offboard, and audit staff’s access to everything. Simplify your access control strategy today and schedule a demo to see how much easier your life can be.


OxyLabs Logo

Scrape Your Web Data From Any Target
With Oxylabs Scraper APIs, extract public data from the most complex targets. Our Scraper APIs handle JavaScript-heavy websites and support large data requests using over 102 million global proxy infrastructure. Receive data in JSON or CSV format and pay per successful request only. Get free trial

Databand LogoDataband.ai is a unified Data Observability Platform that helps DataOps teams catch and solve data health issues fast. Databand.ai’s platform helps data engineers pinpoint pipeline issues and quickly identify their root cause so DataOps can begin working on a resolution before bad data is delivered. Whether you’re using Apache Spark, Apache Airflow, Databricks, Amazon S3, self-hosted python scripts, or combinations of these, Databand.ai allows you to monitor data health along every step of its journey. Powerful integrations to 20+ tools gives you full visibility of your stack. Our mission is to help businesses trust their data with the most powerful Data Observability Platform. Experience unified observability with a free trial today: www.databand.ai


Segment provides the reliable data infrastructure companies need to easily collect, clean, and control their customer data. Once you try it, you’ll understand why Segment is one of the hottest companies coming out of Silicon Valley. Segment recently launched a ​Startup Program​ so that early-stage startups can get a Segment account totally free up to $25k, plus exclusive deals from some favorite vendors and other resources to become data experts. Go to dataengineeringpodcast.com/segmentio today and see if you or a startup you know qualify for the program today.

Springboard LogoMachine learning is finding its way into every aspect of software engineering, making understanding it critical to future success. Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their Machine Learning Engineering career track program. In this online, project-based course every student is paired with a Machine Learning expert who provides unlimited 1:1 mentorship support throughout the program via video conferences. You’ll build up your portfolio of machine learning projects and gain hands-on experience in writing machine learning algorithms, deploying models into production, and managing the lifecycle of a deep learning prototype.

Springboard offers a job guarantee, meaning that you don’t have to pay for the program until you get a job in the space. The Data Engineering Podcast is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there’s no obligation. Go to dataengineeringpodcast.com/springboard and apply today! Make sure to use the code AISPRINGBOARD when you enroll.


Kyligence Logo

Kyligence was founded in 2016 by the original creators of Apache Kylin™,  the leading open source OLAP for Big Data. Kyligence offers an Intelligent OLAP Platform to simplify multidimensional analytics for cloud data lake. Its AI-augmented engine detects patterns from most frequently asked business queries, builds governed data marts automatically and brings metrics accountability on the data lake to optimize data pipeline and avoid excessive number of tables. It provides a unified SQL interface between the cloud object store, cubes, indexes and underlying data sources with a cost-based smart query router for business intelligence, ad-hoc analytics and data services at PB-scale.

Kyligence is trusted by global leaders in financial services, manufacturing and retail industries including UBS, China Construction Bank, China Merchants Bank, Pingan Bank, MetLife, Costa and Appzen. With technology partnership with Microsoft, Amazon, Tableau and Huawei, Kyligence is on a mission to simplify and govern data lakes to be productive for critical business analytics and data services.

Kyligence is dual headquartered in San Jose, CA, United States and Shanghai, China, and is backed by leading investors including Redpoint Ventures, Cisco, Broadband Capital, Shunwei Capital, Eight Roads Ventures, Coatue Management, SPDB International, CICC, Gopher Assets, Guofang Capital, ASG, Jumbo Sheen Fund, and Puxin Capital.

Go to dataengineeringpodcast.com/kyligence today to find out more.


GoodData LogoGoodData is revolutionizing the way in which companies provide analytics to their customers and partners.

With data-driven applications now becoming the new norm, GoodData allows you to easily provide tailored scalable data access to multiple companies, groups, and users. Ready to see how you can get started? Start now with GoodData Free, our product offering that makes our self-service analytics platform available to you at no cost. When you sign up for GoodData Free, you get five workspaces for an unlimited number of users. You can continue to use GoodData Free for as long as you like, and our support team is available for whatever you need. If at any point you’d like to take your analytics to the next level, our team can guide you through the process of transitioning to our Growth or Enterprise tiers.


Join us at the Data Orchestration Summit on November 7 at the Computer History Museum in Mountain View hosted by Alluxio! This one day community conference is focused on the key data engineering challenges and solutions around building analytics and AI platforms. Attendees will hear from companies including Walmart, Netflix, Google, DBS Bank, on how they leveraged technologies such as Alluxio, Presto, Spark, Tensorflow, and you will also hear from creators of open source projects including Alluxio, Presto, Airflow, Iceberg, and more! Use discount code PODCAST for 25% off tickets. Admissions also includes a free training session on getting started with Presto and Alluxio in AWS run by the creators of Presto and Alluxio. Attendees will takeaway learnings, swag, a free voucher to visit the museum, and a chance to win the latest ipad Pro!

Data Engineering Podcast listeners get 25% off with discount code PODCASTRegister here!


About Mode
Mode is the only data platform built by data experts for data experts. With Mode, analysts and data scientists work how they want to, with a powerful end-to-end workflow that covers everything from exploration stages to final, shareable product. They get the flexibility to work with raw or modeled data without moving between different programs, and Mode’s robust collaboration tools make it easy to work with other data experts on their team. As a result, they can mine for more opportunities, diagnose bigger business problems, predict outcomes, and make recommendations for the future faster than ever before.

Check out the data analysis platform that Lyft trusts at dataengineeringpodcast.com/mode-lyft


Timescale Logo

TimescaleDB is the leading open-source relational database with support for time-series data, which is time stamped so you can measure how a system is changing. TimescaleDB brings you the familiarity of PostgreSQL with the speed and petabyte-scale required to handle such relentless data.

Understand the past, monitor the present, and predict the future. Get started today at dataengineeringpodcast.com/timescale.


DataKitchen offers the first end-to-end DataOps Platform that empowers teams to reclaim control of their data pipelines and deliver business value instantly, without errors.  The platform automates and coordinates all the people, tools, and environments in your entire data analytics organization – everything from orchestration, testing and monitoring to development and deployment.  It’s DataOps Delivered.

Read The DataOps Cookbook or contact us to learn more at DataKitchen.io.


Census LogoCensus is the operational analytics platform that syncs your cloud warehouse with all the SaaS applications used by your Sales, Marketing & Success teams. If you need to get your company data into Salesforce, Marketo, Hubspot, Intercom, Zendesk, and other tools, Census is the easiest way to do so. Just write SQL (or plug in your dbt models), set up the sync frequencies, and voila, your data is now available to be used by all of your teams.  No need to worry about incremental sync, backfilling, API quota management, API versioning, monitoring, and maintaining custom scripts. Just SQL. Start your free 14-day trial now.


Datadog LogoDatadog is a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog delivers complete visibility into the performance of modern applications in one place through its fully unified platform—which improves cross-team collaboration, accelerates development cycles, and reduces operational and development costs.

Try Datadog in your environment with a free 14-day trial—and get a complimentary t-shirt if you install the agent. Go to datadog.com/dataengineeringpodcast to get started!

Molecula LogoMolecula allows businesses to operationalize AI projects through a novel data format and purpose-built feature storage system. Molecula’s technology automates the extraction of features from raw data at the source, enabling unified, instant access to massive quantities of big data in a highly-performant format. From feature extraction to model training to production, the Molecula feature store provides continuously updated feature access, reuse, and sharing without the need to pre-process data.

If you need to deliver unprecedented speed, cost savings, and simplified access to large scale, real-time data, visit dataengineeringpodcast.com/molecula and request a demo. Mention that you’re a Data Engineering Podcast listener, and they’ll send you a free t-shirt.


Your Data Scientist finished a new Machine Learning model, so he sends you his python script and wishes you good luck. Now you have to figure out where to put it and plead with DevOps to deploy it. Not to mention write the API to consume the model’s results.

Wouldn’t it make your job easier if the Data Science team could build, train, deploy and monitor their models independently? Metis Machine agrees.

Meet Skafos, the machine learning platform that enables teams of data scientists to drastically speed up the time to market by providing tools and workflows that are familiar and easy. Serverless ML production deployment is as simple as “git push”. Skafos orchestrates your jobs seamlessly, guaranteeing they will run.

Skafos handles the tedious and time-consuming work of applying Machine Learning at scale so you can focus on what you do best.

The team here at Metis Machine shipped a proof-of-concept integration between our powerful machine learning platform Skafos, and the business intelligence software Tableau. BI teams can now invoke custom-built machine learning models built by in-house science teams.

Does that sound awesome? It is.

Join Metis Machine’s free webinar to walk through the architecture of this extension, demonstrate its capabilities in real time, and lay out a use case for empowering your BI team to modify machine learning models independently and immediately see the results, right from Tableau. You have to see it to believe it. So join us on October 11th at 2 PM ET (11 AM PT) and see what
Skafos + Tableau can do.

To register, go to metismachine.com/webinars


Hate data conferences that are swarming with sales people! So do we!, That’s why we created a better one! Data Council Helps Technical Professionals Stay
Abreast of the Latest Advancements in Data Engineering, Science & Machine Intelligence. This April we will host 6 unique tracks and 50+ speakers over 2 full
days of deeply technical learning and fun. We are offering a $200 discount to listeners of the Data Engineering Podcast. Use code: DEP-200 at checkout


Datacoral Logo

Datacoral is this week’s Data Engineering Podcast sponsor.  Datacoral provides an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to construct its infrastructure. Datacoral’s customers report that their data engineers are able to spend 80% of their work time invested in data transformations, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from mere terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral for more information.  


This episode of the Data Engineering Podcast is brought to you by Clubhouse, the first project management platform for software development that brings everyone together so that teams can focus on what matters – creating products their customers love. Clubhouse provides the perfect balance of simplicity and structure for better cross-functional collaboration. Its fast, intuitive interface makes it easy for people on any team to focus-in on their work on a specific task or project, while also being able to “zoom out” to see how that work is contributing towards the bigger picture. With a simple API and robust set of integrations, Clubhouse also seamlessly integrates with the tools you use everyday, getting out of your way so that you can deliver quality software on time.

Listeners of the Data Engineering Podcast can sign up for two free months of Clubhouse by visiting dataengineeringpodcast.com/clubhouse.


Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer.

How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage.

Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark.

Sign up for a free account today at dataengineeringpodcast.com/prophecy


Talk Python LogoDo you want to get better at Python? Now is an excellent time to take an online course. Whether you’re just learning Python or you’re looking for deep dives on topics like APIs, memory mangement, async and await, and more, our friends at Talk Python Training have a top-notch course for you. If you’re just getting started, be sure to check out the Python for Absolute Beginners course. It’s like the first year of computer science that you never took compressed into 10 fun hours of Python coding and problem solving. Go to dataengineeringpodcast.com/talkpython today and get 10% off the course that will help you find your next level. That’s dataengineeringpodcast.com/talkpython, and don’t forget to thank them for supporting the show.


Satori LogoSatori created the first DataSecOps solution which streamlines data access while solving the most difficult security and privacy challenges. The Secure Data Access Service is a universal visibility and control plane across all data stores, allowing you to oversee your data and its usage in real-time while automating access controls. The service maps all of the organization’s sensitive data and monitors all data flows in real-time across all data stores. Satori enables your organization to replace cumbersome permissions with streamlined just-in-time data access workflows. It acts as a universal policy engine for data access by enforcing access policies, masking or anonymizing data, and initiating off-band access workflows.

Satori integrates into your environment in minutes by simply replacing the data store URL. Since Satori’s solution is transparent, there is no need to change your existing data flow or data store configuration. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription.

Equalum LogoEqualum’s end to end data ingestion platform is relied upon by enterprises across industries to seamlessly stream data to operational, real-time analytics and machine learning environments. Equalum combines streaming Change Data Capture, bulk replication of large, complex data sets, full management, monitoring, and alerting with validation, synchronization between sources and targets as well as modern data transformation capabilities all using the platform’s no-code UI. While real-time ingestion and integration are a core strength of the platform, Equalum also supports high scale batch processing.

Equalum also supports both structured and semi-structured data formats, and can run on-premises, in public clouds or in hybrid environments. Equalum’s library of optimized and developed CDC connectors is one of the largest in the world, and more are developed and rolled out on a continuous basis, largely based on customer demand. Equalum’s multi-modal approach to data ingestion can power a multitude of use cases including CDC Data Replication, CDC ETL ingestion, batch ingestion and more.

Equalum also leverages open source data frameworks by orchestrating Apache Spark, Kafka and others under the hood. The platform’s easy to use, drag and drop UI eliminates IT productivity bottlenecks with rapid deployment and simple data pipeline setup. The platform’s comprehensive data monitoring eliminates the need for endless DIY patch fixes to broken pipelines and challenging open source frameworks management, empowering the user with immediate system diagnostics, solution options and visibility into data integrity.

Develop and operationalize your batch and streaming pipelines with infinite scalability and speed without the legacy platform price tag. Go to dataengineeringpodcast.com/equalum today to start a free 2 week test run of their platform, and don’t forget to tell them that we sent you.


At StreamSets, our mission is to make data engineering teams wildly successful. The StreamSets DataOps Platform is a modern, end-to-end data integration platform to build, run, monitor and manage data pipelines, and embraces the DataOps philosophy of continuous data. Only StreamSets provides a single design experience for all design patterns to enable 10x greater developer productivity; smart data pipelines that are resilient to change to reduce breakages by 80%; and a single pane of glass for managing and monitoring all pipelines across hybrid and cloud architectures to eliminate blind spots and control gaps. With StreamSets, you can deliver continuous data for modern analytics, despite constant change.

Visit dataengineeringpodcast.com/streamsets to learn more and try for free. The first 10 listeners of the podcast that subscribe to StreamSets’ Professional Tier, receive 2 months free after their first month.


Acryl Data LogoThe modern data stack needs a reimagined metadata management platform. Acryl Data’s vision is to bring clarity to your data through its next generation multi-cloud metadata management platform. Founded by the leaders that created projects like LinkedIn DataHub and Airbnb Dataportal, Acryl Data enables delightful search and discovery, data observability, and federated governance across data ecosystems. Signup for the SaaS product today at dataengineeringpodcast.com/acryl


Tree Schema is a data catalog that makes metadata management accessible to everyone – with essential capabilities including data lineage, automated schema inference, rich-text documentation, asset tagging and more. Built from the ground up with a focus on the intersection of people and data, Tree Schema helps every member on your team collaborate on how to access, understand and use your data.

With the most transparent pricing in the industry – $99/mo for your entire company – and a money-back guarantee for excellent service, you’ll love Tree Schema as much as you love your data. Go to dataengineeringpodcast.com/treeschema today to get your first month free, and mention this podcast to get 50% off your first three months after the trial, a $150 value.



Alluxio provides an open source unified data orchestration layer for hybrid and multi-cloud environments, making data accessible wherever data computation and processing is done. By seamlessly pulling data from underlying data silos, Alluxio unlocks the value of data and allows for modern data-intensive workloads to become truly elastic and flexible for the cloud.
Want a free Alluxio t-shirt? Sign up below and we’ll send one to you!

[ninja_form id=2]

Qubz LogoIf you’re looking for a way to optimize your data engineering pipeline – with instant query performance – look no further than Qubz. Qubz is next-generation OLAP technology built for the scale of Big Data from UST Global, a renowned digital services provider. Qubz lets users and enterprises analyze data on the cloud and on-premise, with blazing speed, while eliminating the complex engineering required to operationalize analytics at scale. With an emphasis on visual data engineering, connectors for all major BI tools and data sources, Qubz allow users to query OLAP cubes with sub-second response times on hundreds of billions of rows. To learn more, and sign up for a free demo for listeners of the show, visit dataengineeringpodcast.com/qubz and be sure to let them know that you heard about them on the Data Engineering Podcast.