The Data Engineering Podcast is supported by some great companies. Check out the sponsors that have helped to bring you the stories behind the projects that you use every day and the people who built them.
Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!
Have you ever woken up to a crisis because a number on a dashboard is broken and no one knows why? Or sent out frustrating slack messages trying to find the right data set? Or tried to understand what a column name means?
Our friends at Atlan started out as a data team themselves and faced all this collaboration chaos themselves, and started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more.
Go to dataengineeringpodcast.com/atlan and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription.
Hightouch is the leading Reverse ETL platform. Your data warehouse is your source of truth for customer data. Hightouch syncs this data to the tools that your business teams rely on. Hightouch has a catalog of flexible destinations including Salesforce, HubSpot, Zendesk, NetSuite, and ad platforms like Facebook or Google. Hightouch is built for data engineers and is a natural extension to the modern data stack with out-of-the-box integrations with your favorite tools like dbt, Fivetran, Airflow, Slack, PagerDuty, and DataDog.
It’s simple — connect your data warehouse, paste a SQL query, and use our visual mapper to specify how data should appear in downstream tools. No scripts, just SQL. Get started for free at dataengineeringpodcast.com/hightouch
Databand.ai is a unified Data Observability Platform that helps DataOps teams catch and solve data health issues fast. Databand.ai’s platform helps data engineers pinpoint pipeline issues and quickly identify their root cause so DataOps can begin working on a resolution before bad data is delivered. Whether you’re using Apache Spark, Apache Airflow, Databricks, Amazon S3, self-hosted python scripts, or combinations of these, Databand.ai allows you to monitor data health along every step of its journey. Powerful integrations to 20+ tools gives you full visibility of your stack. Our mission is to help businesses trust their data with the most powerful Data Observability Platform. Experience unified observability with a free trial today: www.databand.ai
Satori created the first DataSecOps solution which streamlines data access while solving the most difficult security and privacy challenges. The Secure Data Access Service is a universal visibility and control plane across all data stores, allowing you to oversee your data and its usage in real-time while automating access controls. The service maps all of the organization’s sensitive data and monitors all data flows in real-time across all data stores. Satori enables your organization to replace cumbersome permissions with streamlined just-in-time data access workflows. It acts as a universal policy engine for data access by enforcing access policies, masking or anonymizing data, and initiating off-band access workflows.
Satori integrates into your environment in minutes by simply replacing the data store URL. Since Satori’s solution is transparent, there is no need to change your existing data flow or data store configuration. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription.
Census is the operational analytics platform that syncs your cloud warehouse with all the SaaS applications used by your Sales, Marketing & Success teams. If you need to get your company data into Salesforce, Marketo, Hubspot, Intercom, Zendesk, and other tools, Census is the easiest way to do so. Just write SQL (or plug in your dbt models), set up the sync frequencies, and voila, your data is now available to be used by all of your teams. No need to worry about incremental sync, backfilling, API quota management, API versioning, monitoring, and maintaining custom scripts. Just SQL. Start your free 14-day trial now.
RudderStack is the smart customer data pipeline. It takes the toil out of building data pipelines that connect your whole customer data stack. Its easy-to-use SDKs and source integrations, Cloud Extract integrations, transformations, and expansive library of destination and warehouse integrations makes building customer data pipelines for both event streaming and cloud-to-warehouse ELT simple. RudderStack’s warehouse-first approach and Warehouse Actions functionality makes your customer data stack smarter by enabling analysis and modeling in your data warehouse to trigger enrichment and activation in all of your customer tools. Start building smarter customer data pipelines today with RudderStack. Visit dataengineeringpodcast.com/rudder to learn more and sign-up for our no credit card required, no time limit free tier.
Firebolt is the world’s fastest cloud data warehouse, purpose-built for high performance analytics. It provides orders of magnitude faster query performance at a fraction of the cost compared to alternatives. Companies that adopted Firebolt have been able to deploy data warehouses in weeks and deliver sub-second performance at terabyte to petabyte scale for a wide range of interactive, high performance analytics across internal BI as well as customer facing analytics use cases. Visit dataengineeringpodcast.com/firebolt to get started.
Managing data in your warehouse today is hard. Often you’ll find yourself relying on manual work and hacks to get something done. Data knowledge is fragmented across the team and the data is often unreliable. So you hire a data engineer and they spend all their time managing this custom infrastructure when what they want to be doing is focusing on writing code.
Dataform enables data teams to own the end to end data transformation process by giving them a collaborative web based platform where they can develop SQL code with real time error checking, run schedules, write data quality tests and use a data catalog to document.
Enabling analysts to publish tables and maintain complex SQL workflows without requiring the help of engineers. And letting data engineers focus their time on transformation code instead of having to maintain custom infrastructure.
Integrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin.
Machine learning is finding its way into every aspect of software engineering, making understanding it critical to future success. Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their Machine Learning Engineering career track program. In this online, project-based course every student is paired with a Machine Learning expert who provides unlimited 1:1 mentorship support throughout the program via video conferences. You’ll build up your portfolio of machine learning projects and gain hands-on experience in writing machine learning algorithms, deploying models into production, and managing the lifecycle of a deep learning prototype.
Springboard offers a job guarantee, meaning that you don’t have to pay for the program until you get a job in the space. The Data Engineering Podcast is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there’s no obligation. Go to dataengineeringpodcast.com/springboard and apply today! Make sure to use the code AISPRINGBOARD when you enroll.
Datafold gives you visibility and confidence in the quality of your analytical data with fast dataset diffing, profiling, column-level lineage, and intelligent anomaly detection. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI, so in a few minutes you can get from 0 to automated testing of your analytical code. Go to dataengineeringpodcast.com/datafold to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
Listen, I’m sure you work for a ‘data driven’ company – who doesn’t these days?? Does your company use Amazon Redshift? Have you ever groaned over slow queries or are just afraid that Amazon Redshift is gonna fall over at some point??
Well, you GOTTA talk to the folks over at intermix.io. They have built the “missing” Amazon Redshift console – it’s an amazing analytics product for data engineers to find and re-write slow queries and gives actionable recommendations to optimize data pipelines. WeWork, Postmates, and Medium are just a few of their customers.
DEP listeners get a $50 discount! Just go to dataengineeringpodcast.com/intermix and use promo code DEP at sign up.
You invest so much in your data infrastructure – you simply can’t afford to settle for unreliable data. Fortunately, there’s hope: in the same way that New Relic, DataDog, and other Application Performance Management solutions ensure reliable software and keep application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines.
Monte Carlo’s Data Observability Platform monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence. The platform uses machine learning to infer and learn your data, proactively identify data issues, assess its impact through lineage, and notify those who need to know. By automatically and immediately identifying the root cause of an issue, teams can easily collaborate and resolve problems faster. By empowering data teams with end-to-end data reliability, Monte Carlo helps organizations save time, increase revenue, and restore trust in their data. Visit www.dataengineeringpodcast.com/montecarlo today to request a demo and see how Monte Carlo delivers data observability across your data infrastructure. The first 25 who use the promo code “PODCAST” will receive a free, limited edition Monte Carlo hat!
Do you want to get better at Python? Now is an excellent time to take an online course. Whether you’re just learning Python or you’re looking for deep dives on topics like APIs, memory mangement, async and await, and more, our friends at Talk Python Training have a top-notch course for you. If you’re just getting started, be sure to check out the Python for Absolute Beginners course. It’s like the first year of computer science that you never took compressed into 10 fun hours of Python coding and problem solving. Go to dataengineeringpodcast.com/talkpython today and get 10% off the course that will help you find your next level. That’s dataengineeringpodcast.com/talkpython, and don’t forget to thank them for supporting the show.
With data-driven applications now becoming the new norm, GoodData allows you to easily provide tailored scalable data access to multiple companies, groups, and users. Ready to see how you can get started? Start now with GoodData Free, our product offering that makes our self-service analytics platform available to you at no cost. When you sign up for GoodData Free, you get five workspaces for an unlimited number of users. You can continue to use GoodData Free for as long as you like, and our support team is available for whatever you need. If at any point you’d like to take your analytics to the next level, our team can guide you through the process of transitioning to our Growth or Enterprise tiers.
Ascend.io, the data engineering company, provides the flex-code data platform for autonomous pipelines that frees data teams to spend more time innovating. Data pipelines are the backbone of modern data systems. However, data engineers are overburdened with building and maintaining brittle pipelines, which creates a backlog that prevents data analysts and data scientists from accessing critical information. The Ascend Unified Data Engineering Platform removes these bottlenecks and enables teams to create self-service data pipelines that dynamically adapt to changes in data, code, and environment.
Molecula allows businesses to operationalize AI projects through a novel data format and purpose-built feature storage system. Molecula’s technology automates the extraction of features from raw data at the source, enabling unified, instant access to massive quantities of big data in a highly-performant format. From feature extraction to model training to production, the Molecula feature store provides continuously updated feature access, reuse, and sharing without the need to pre-process data.
If you need to deliver unprecedented speed, cost savings, and simplified access to large scale, real-time data, visit dataengineeringpodcast.com/molecula and request a demo. Mention that you’re a Data Engineering Podcast listener, and they’ll send you a free t-shirt.
Join us at the Data Orchestration Summit on November 7 at the Computer History Museum in Mountain View hosted by Alluxio! This one day community conference is focused on the key data engineering challenges and solutions around building analytics and AI platforms. Attendees will hear from companies including Walmart, Netflix, Google, DBS Bank, on how they leveraged technologies such as Alluxio, Presto, Spark, Tensorflow, and you will also hear from creators of open source projects including Alluxio, Presto, Airflow, Iceberg, and more! Use discount code PODCAST for 25% off tickets. Admissions also includes a free training session on getting started with Presto and Alluxio in AWS run by the creators of Presto and Alluxio. Attendees will takeaway learnings, swag, a free voucher to visit the museum, and a chance to win the latest ipad Pro!
Data Engineering Podcast listeners get 25% off with discount code PODCAST. Register here!
This episode of the Data Engineering Podcast is brought to you by Clubhouse, the first project management platform for software development that brings everyone together so that teams can focus on what matters – creating products their customers love. Clubhouse provides the perfect balance of simplicity and structure for better cross-functional collaboration. Its fast, intuitive interface makes it easy for people on any team to focus-in on their work on a specific task or project, while also being able to “zoom out” to see how that work is contributing towards the bigger picture. With a simple API and robust set of integrations, Clubhouse also seamlessly integrates with the tools you use everyday, getting out of your way so that you can deliver quality software on time.
Listeners of the Data Engineering Podcast can sign up for two free months of Clubhouse by visiting dataengineeringpodcast.com/clubhouse.
Mode is the only data platform built by data experts for data experts. With Mode, analysts and data scientists work how they want to, with a powerful end-to-end workflow that covers everything from exploration stages to final, shareable product. They get the flexibility to work with raw or modeled data without moving between different programs, and Mode’s robust collaboration tools make it easy to work with other data experts on their team. As a result, they can mine for more opportunities, diagnose bigger business problems, predict outcomes, and make recommendations for the future faster than ever before.
Check out the data analysis platform that Lyft trusts at dataengineeringpodcast.com/mode-lyft
Hate data conferences that are swarming with sales people! So do we!, That’s why we created a better one! Data Council Helps Technical Professionals Stay
Abreast of the Latest Advancements in Data Engineering, Science & Machine Intelligence. This April we will host 6 unique tracks and 50+ speakers over 2 full
days of deeply technical learning and fun. We are offering a $200 discount to listeners of the Data Engineering Podcast. Use code: DEP-200 at checkout
Your Data Scientist finished a new Machine Learning model, so he sends you his python script and wishes you good luck. Now you have to figure out where to put it and plead with DevOps to deploy it. Not to mention write the API to consume the model’s results.
Wouldn’t it make your job easier if the Data Science team could build, train, deploy and monitor their models independently? Metis Machine agrees.
Meet Skafos, the machine learning platform that enables teams of data scientists to drastically speed up the time to market by providing tools and workflows that are familiar and easy. Serverless ML production deployment is as simple as “git push”. Skafos orchestrates your jobs seamlessly, guaranteeing they will run.
Skafos handles the tedious and time-consuming work of applying Machine Learning at scale so you can focus on what you do best.
The team here at Metis Machine shipped a proof-of-concept integration between our powerful machine learning platform Skafos, and the business intelligence software Tableau. BI teams can now invoke custom-built machine learning models built by in-house science teams.
Does that sound awesome? It is.
Join Metis Machine’s free webinar to walk through the architecture of this extension, demonstrate its capabilities in real time, and lay out a use case for empowering your BI team to modify machine learning models independently and immediately see the results, right from Tableau. You have to see it to believe it. So join us on October 11th at 2 PM ET (11 AM PT) and see what
Skafos + Tableau can do.
To register, go to metismachine.com/webinars
Tidy Data is a monitoring platform to help you monitor your data pipeline. Custom in-house solutions are costly, laborious, and fragile. Replacing them with Tidy Data’s consistent managed data ops platform will solve these issues. Monitor your data pipeline like you monitor your website. It’s like pingdom for data. No credit card required to sign up. Go to dataengineeringpodcast.com/tidydata today to get started with their free tier.
Tree Schema is a data catalog that makes metadata management accessible to everyone – with essential capabilities including data lineage, automated schema inference, rich-text documentation, asset tagging and more. Built from the ground up with a focus on the intersection of people and data, Tree Schema helps every member on your team collaborate on how to access, understand and use your data.
With the most transparent pricing in the industry – $99/mo for your entire company – and a money-back guarantee for excellent service, you’ll love Tree Schema as much as you love your data. Go to dataengineeringpodcast.com/treeschema today to get your first month free, and mention this podcast to get 50% off your first three months after the trial, a $150 value.
Datacoral is this week’s Data Engineering Podcast sponsor. Datacoral provides an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to construct its infrastructure. Datacoral’s customers report that their data engineers are able to spend 80% of their work time invested in data transformations, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from mere terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral for more information.
If you’re looking for a way to optimize your data engineering pipeline – with instant query performance – look no further than Qubz. Qubz is next-generation OLAP technology built for the scale of Big Data from UST Global, a renowned digital services provider. Qubz lets users and enterprises analyze data on the cloud and on-premise, with blazing speed, while eliminating the complex engineering required to operationalize analytics at scale. With an emphasis on visual data engineering, connectors for all major BI tools and data sources, Qubz allow users to query OLAP cubes with sub-second response times on hundreds of billions of rows. To learn more, and sign up for a free demo for listeners of the show, visit dataengineeringpodcast.com/qubz and be sure to let them know that you heard about them on the Data Engineering Podcast.
Have you ever found yourself lost in a pile of directories, each only differing by a cryptic and poorly considered version number, wishing that you could just dump it all into your source control system to track changes and change history? Lucky for you the fine folks at Quilt Data were in the same boat and decided to build something just for you! Quilt is an open source platform for managing your data sets in the same way that you manage your software. It includes metadata management, version history, and distributed delivery so that you can build a workflow that works for your whole team.
Stop by their booth at JupyterCon in New York City on August 22nd through the 24th to say Hi and tell them that the Data Engineering Podcast sent you! After that, keep an eye on the AWS marketplace for a pre-packaged version of Quilt for Teams to deploy into your own environment and stop fighting with your data.
Feature flagging is a simple concept that enables you to ship features faster, test in production, and do easy rollbacks without redeploying code. Teams using feature flags release new features with less risk, and release more often. Developers using feature flags need to merge less.
This episode is sponsored ConfigCat.
- Easily use flags in your code with ConfigCat libraries for Python and 9 other platforms.
- Toggle your feature flags visually on the visual Dashboard.
- Hide or expose features in your application without redeploying code.
- Set targeting rules to allow you to control who has access to new features.
ConfigCat allows you to get features out faster, test in production, and do easy rollbacks.
With ConfigCat’s simple API and clear documentation, you’ll have your initial proof of concept up and running in minutes. Train new team members in minutes also, and you don’t pay extra for team size. With the simple UI, the whole team can use it effectively.
Whether you are an individual or team you can try it out with their forever free plan. Or get 35% off any paid plan with code DATAENGINEERING
Release features faster with less risk with ConfigCat. Check them out at today at dataengineeringpodcast.com/configcat
Alluxio provides an open source unified data orchestration layer for hybrid and multi-cloud environments, making data accessible wherever data computation and processing is done. By seamlessly pulling data from underlying data silos, Alluxio unlocks the value of data and allows for modern data-intensive workloads to become truly elastic and flexible for the cloud.
Want a free Alluxio t-shirt? Sign up below and we’ll send one to you!
Equalum’s end to end data ingestion platform is relied upon by enterprises across industries to seamlessly stream data to operational, real-time analytics and machine learning environments. Equalum combines streaming Change Data Capture, bulk replication of large, complex data sets, full management, monitoring, and alerting with validation, synchronization between sources and targets as well as modern data transformation capabilities all using the platform’s no-code UI. While real-time ingestion and integration are a core strength of the platform, Equalum also supports high scale batch processing.
Equalum also supports both structured and semi-structured data formats, and can run on-premises, in public clouds or in hybrid environments. Equalum’s library of optimized and developed CDC connectors is one of the largest in the world, and more are developed and rolled out on a continuous basis, largely based on customer demand. Equalum’s multi-modal approach to data ingestion can power a multitude of use cases including CDC Data Replication, CDC ETL ingestion, batch ingestion and more.
Equalum also leverages open source data frameworks by orchestrating Apache Spark, Kafka and others under the hood. The platform’s easy to use, drag and drop UI eliminates IT productivity bottlenecks with rapid deployment and simple data pipeline setup. The platform’s comprehensive data monitoring eliminates the need for endless DIY patch fixes to broken pipelines and challenging open source frameworks management, empowering the user with immediate system diagnostics, solution options and visibility into data integrity.
Develop and operationalize your batch and streaming pipelines with infinite scalability and speed without the legacy platform price tag. Go to dataengineeringpodcast.com/equalum today
Businesses are increasingly faced with the challenge of satisfying several, often conflicting, demands regarding sensitive data. From sharing and using sensitive data to complying with regulations and navigating new cloud-based platforms, Immuta helps solve these needs and more.
With automated, scalable data access and privacy controls, and enhanced collaboration between data and compliance teams, Immuta empowers data teams to easily access the data they need, when they need it – all while protecting sensitive data and ensuring their customers’ privacy. Immuta integrates with leading technology and solutions providers so you can govern your data on your desired analytic system.
Start a free trial of Immuta to see the power of automated data governance for yourself.
Datadog is a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog delivers complete visibility into the performance of modern applications in one place through its fully unified platform—which improves cross-team collaboration, accelerates development cycles, and reduces operational and development costs.
What happens when your expanding log & event data threatens to topple your Elasticsearch strategy? Whether you’re running your own ELK Stack or leveraging an Elasticsearch-based service, unexpected costs and data retention limits quickly mount. Now try CHAOSSEARCH. Run your entire logging infrastructure on your AWS S3. Never move your data. Fully managed service. Half the cost of Elasticsearch. Check out this short video overview of CHAOSSEARCH today! Forget Elasticsearch! Try – search analytics on your AWS S3.
Enabling real-time analytics is a huge task. Without a data warehouse that outperforms the demands of your customers at a fraction of cost and time, this big task can also prove challenging. But it doesn’t have to be tiring or difficult with ClickHouse — an open-source analytical database that deploys and scales wherever and whenever you want it to and turns data into actionable revenue. And Altinity is the leading ClickHouse software and service provider on a mission to help data engineers and DevOps managers. Go to dataengineeringpodcast.com/altinity to find out how with a free consultation.
DataKitchen offers the first end-to-end DataOps Platform that empowers teams to reclaim control of their data pipelines and deliver business value instantly, without errors. The platform automates and coordinates all the people, tools, and environments in your entire data analytics organization – everything from orchestration, testing and monitoring to development and deployment. It’s DataOps Delivered.
strongDM enables you to easily manage and audit access to databases and servers. Leading organizations including Hearst, SoFi, and Peloton rely on strongDM to eliminate the manual-heavy work required to onboard, offboard, and audit staff’s access to everything. Simplify your access control strategy today and schedule a demo to see how much easier your life can be.
Segment provides the reliable data infrastructure companies need to easily collect, clean, and control their customer data. Once you try it, you’ll understand why Segment is one of the hottest companies coming out of Silicon Valley. Segment recently launched a Startup Program so that early-stage startups can get a Segment account totally free up to $25k, plus exclusive deals from some favorite vendors and other resources to become data experts. Go to dataengineeringpodcast.com/segmentio today and see if you or a startup you know qualify for the program today.
Many data engineers say the most frustrating part of their job is spending too much time maintaining and monitoring their data pipeline. Snowplow works with data-informed businesses to set up a real-time event data pipeline, taking care of installation, upgrades, autoscaling, and ongoing maintenance so you can focus on the data.
Snowplow runs in your own cloud account giving you complete control and flexibility over how your data is collected and processed. Best of all, Snowplow is built on top of open source technology which means you have visibility into every stage of your pipeline, with zero vendor lock in.
At Snowplow, we know how important it is for data engineers to deliver high-quality data across the organization. That’s why the Snowplow pipeline is designed to deliver complete, rich and accurate data into your data warehouse of choice. Your data analysts define the data structure that works best for your teams, and we enforce it end-to-end so your data is ready to use.
Get in touch with our team to find out how Snowplow can accelerate your analytics. Go to dataengineeringpodcast.com/snowplow. Set up a demo and mention you’re a listener for a special offer!