Summary
Data analysis is a valuable exercise that is often out of reach of non-technical users as a result of the complexity of data systems. In order to lower the barrier to entry Ryan Buick created the Canvas application with a spreadsheet oriented workflow that is understandable to a wide audience. In this episode Ryan explains how he and his team have designed their platform to bring everyone onto a level playing field and the benefits that it provides to the organization.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
- Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
- Unstruk is the DataOps platform for your unstructured data. The options for ingesting, organizing, and curating unstructured files are complex, expensive, and bespoke. Unstruk Data is changing that equation with their platform approach to manage your unstructured assets. Built to handle all of your real-world data, from videos and images, to 3d point clouds and geospatial records, to industry specific file formats, Unstruk streamlines your workflow by converting human hours into machine minutes, and automatically alerting you to insights found in your dark data. Unstruk handles data versioning, lineage tracking, duplicate detection, consistency validation, as well as enrichment through sources including machine learning models, 3rd party data, and web APIs. Go to dataengineeringpodcast.com/unstruk today to transform your messy collection of unstructured data files into actionable assets that power your business.
- Your host is Tobias Macey and today I’m interviewing Ryan Buick about Canvas, a spreadsheet interface for your data that lets everyone on your team explore data without having to learn SQL
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Canvas is and the story behind it?
- The "modern data stack" has enabled organizations to analyze unparalleled volumes of data. What are the shortcomings in the operating model that keeps business users dependent on engineers to answer their questions?
- Why is the spreadsheet such a popular and persistent metaphor for working with data?
- What are the biggest issues that existing spreadsheet software run up against as they scale both technically and organizationally?
- What are the new metaphors/design elements that you needed to develop to extend the existing capabilities and use cases of spreadsheets while keeping them familiar?
- Can you describe how the Canvas platform is implemented?
- How have the design and goals of the product changed/evolved since you started working on it?
- What is the workflow for a business user that is using Canvas to iterate on a series of questions?
- What are the collaborative features that you have built into Canvas and who are they for? (e.g. other business users, data engineers <-> business users, etc.)
- What are the situations where the spreadsheet abstraction starts to break down?
- What are the extension points/escape hatches that you have built into the product for when that happens?
- What are the most interesting, innovative, or unexpected ways that you have seen Canvas used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Canvas?
- When is Canvas the wrong choice?
- What do you have planned for the future of Canvas?
Contact Info
- @ryanjbuick on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans could focus on delivering real value. Go to data engineering podcast.com/atlan today, that's a t l a n, to learn more about how Atlas Active Metadata platform is helping pioneering data teams like Postman, Plaid, WeWork, and Unilever achieve extraordinary things with metadata.
When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to data engineering podcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Ryan Buick about Canvas, a spreadsheet interface for your data that lets everyone on your team explore data without having to learn SQL. So, Ryan, can you start by introducing yourself?
[00:01:39] Unknown:
Yeah. Hey. Thanks for having me on. So I'm Ryan, 1 of the founders of Canvas. We've been around for about a a year now. Our mission is, like you said, to help really bring the modern data stack to operators and to business teams so that they can, you know, make better, faster decisions, and data teams can, you know, continue to focus on the work that really matters to them. And do you remember how you first got started working in the area of data? A bit of a weird story. So started a couple years ago. I was 1 of the first product managers at a company called Flexport. So if your audience isn't familiar with Flexport, it's basically a tech enabled freight forwarder. So trying to disrupt the millennial old industry of shipping goods from point a to point b and all of the the complexities around that.
Pretty, pretty topical nowadays. So, yeah, so I joined 1 of the first PMs there and really didn't have all that much experience with data as a product manager. And so, you know, when I joined, I realized, you know, everything was so data driven. You know, every decision had to be, you know, presented with, you know, objective evidence of, okay. Here's where we're at. Here's, you know, how we think we're gonna improve at MetricX. And, you know, before I had data pretty much served up to me as a product manager, as an operator. And so, honestly, I have a lot of anxiety over it. And so I actually went and took a data analytics boot camp. So spent my first couple months at Flexport doing nights and weekends learning, really, how to write, you know, advanced SQL, how to, you know, think about cleaning data, how to think about, you know, working with data in Excel as well.
And that was super helpful for me in terms of getting exposed to data. And, yeah, that's sort of how I ended up in this space.
[00:03:21] Unknown:
And as far as the Canvas project, I'm wondering if you can describe a bit about what it is and some of the story behind how you decided that that was where you wanted to spend your time and energy and build a product versus just continuing on your career path of being a product manager.
[00:03:37] Unknown:
That story has also started at Flexport. So I met my cofounder, so we were, both engineers. We're on the same team. Really, we had seen just how difficult it was for, you know, Flexport as an acute version of this thing where there's so many operations, employees. Right? And the data team could frankly just never keep up with the amount of requests. And I think we did a good job of, you know, having strategic dashboards, you know, for the business, but there was just so much ad hoc everyday questions that needed to be answered. And there wasn't really a great interface for those questions to be answered. Right? We saw a data team that, you know, was spending 1, 000, 000 of dollars on, you know, the modern data stack and, you know, analysts that could help out each individual team.
But, ultimately, you know, you saw really long lines for data requests. Right? You saw business teams getting frustrated and just exporting data into to spreadsheets, and you start to see these Google sheet sheet reporting empires stand up from team to team. You know? That really creates these silos within the organization and broken feedback loops. No 1 is really knowing what's happening, you know, in these CSVs, and, you know, the business teams are frustrated because they have to export these CSVs with real time data, you know, daily or weekly.
And so worked on an internal data product there that was really the you know, almost a spreadsheet on top of a few different tools for a pretty specific use case, but we decided, hey. You know, why not try to make this horizontal and really give, you know, nontechnical business teams a way to actually work with data in a somewhat independent and and confident way. I started really thinking about it during COVID and decided to jump into it in late 2020. And, you know, I'll probably talk a little bit about what Canvas is. So, really, Canvas is a data exploration tool, data visualization tool primarily for business teams. So we integrate with, you know, the modern data stack. We integrate with Snowflake and BigQuery and Redshift and all the warehouses. We also integrate with dbt, which is, you know, exploding in popularity, obviously.
And I think the interesting thing there is that it really gives business teams for the first time as a sort of a reasonable starting place for working with data. Right? You have, you know, big wide tables that these data teams are producing instead of, you know, trying to answer every single ad hoc request that comes through, focusing more on creating scalable data models that these teams can use. And so we found the spreadsheet interface as a really nice way to say, hey. Data teams, you can actually, you know, share these models with your teams, give them an interface to answer their own questions instead of coming to you directly with the question and be able to explore that so that they can, you know, really get 80% of the way there to answer their question. If they get stuck, they can actually collaborate in our tool. And so data teams can actually inspect the spreadsheet work that's being done via SQL, make changes there, and ultimately get people to an outcome faster.
[00:06:24] Unknown:
You mentioned that Canvas is intended as a way to bring the promise of the modern data stack to people who don't want to put in all of the engineering effort to manage that themselves. I'm wondering if you can just talk to the shortcomings that you see in the operating model for what has come to be known as the modern data stack and how that keeps business users dependent on engineers to for being able to answer the questions that they wanna be able to iterate on? Yeah. For sure. So I think there's
[00:06:53] Unknown:
you know, for all of the advancements, I think, that the modern data stack has brought. Right? We have best of breed tools for each part of the stack. You know, there's a lot of focus there. There's a lot of high quality, you know, bringing developer practices of 10, 20 years ago to data teams. I think it's it's been amazing, but there's still pretty high barrier to entry. Right? The modern data stack is pretty expensive. It requires, you know, having people that are capable of implementing it, building a reasonable, you know, set of data models and collaborating with the business team on scoping that all out. I think the other piece here is, you know, iteration is pretty slow. Right? New fields get added. You know, I talked to line of business owners that are frustrated because they don't know what's happening as soon as they wanna make a couple changes to their Salesforce objects. Right? They don't really understand the delay. So I think there's still a lot of work that we need to do in terms of, you know, bridging the gap between business and data teams and making it clear, hey. You know, the work that you want to do, we are trying to enable as a data team. But that needs to be more of a collaborative process, not, you know, blind business owners saying, hey. We need to make these changes, and then data teams are thought of as an afterthought. Right? Like, that never happens with engineering. They're brought in and they're given a seat at the table, you know, upfront.
And so I think there's a lot of work to do culturally there as well with the modern data stack. And then lastly, I think a lot of the interfaces haven't really caught up with the way that, you know, data is now being thought about. Right? I think there's been a lot of talk about the depth of the dashboard, and I think we pretty much agree on that. Right? A lot of the strategic dashboards have been solved, but a lot of the work that's being done now is more complex questions by by operators and by business teams. It's wanting to compare and contrast and diagnose a couple different tables together in 1 place.
And so I think that's really the heart of what we're trying to do at Canvas, which is give a flexible interface on top of this trusted data and on top of the certified data that data teams are working on and give them a way to just ask whatever question that they have at the moment rather than going into the breadline and just asking for a question. I think those are the shortcomings that we're seeing with the modern data stack, but I think there's a ton of exciting things that are happening to sort of bridge that gap.
[00:08:57] Unknown:
In terms of what you're building at Canvas, you mentioned that the primary interface paradigm is this spreadsheet that everybody has become familiar with over the years. And I'm wondering what you see as the reason that the spreadsheet is such a popular and persistent metaphor for being able to work with data and for making it accessible to people who don't think of themselves as technical?
[00:09:19] Unknown:
Yeah. Yeah. It's a it's a great question. I think it's it's a few things. Right? It's just it's been around forever. It's the first thing that you're taught, you know, in business school, right, is, you know, how to learn Excel. And so I think the sort of institutional powers that be are sort of, you know, propagating and holding up Excel as a way to do this, but I also think it's a great way to just visually program. It's the first programming language that, you know, most people learn, and I think it is very similar to SQL in a lot of ways, and we see it as really 2 sides of the same coin. Of course, there's some things that break down there. I'm I'm sure we'll probably we'll cover those. But, yeah, I think it's an easy way to just match the mental model of however you're thinking about a problem, being able to malleate that and work with that to get to the answer that you want. And it's also just a lot easier to iterate in some cases. Right? Like, of cases that we see of data teams in our tool where they'd rather use a spreadsheet paradigm than use SQL to answer a quick, dirty question. You know, it's familiar. It's fast. It's relatively iterative, so I think that's why it's sticking, and I think why it's continuing to stick. And I think that's something that we're really trying to take advantage of rather than, you know, just looking down upon it as, you know, this thing that's a legacy interface, but really trying to make it better, honestly, in the context of how data is is really so prevalent today in in most startups.
[00:10:35] Unknown:
And in terms of the existing systems that we have for being able to work with data in these spreadsheet interfaces with Excel obviously being the most prominent 1, but things like Google Sheets or any number of other platforms and projects that provide this spreadsheet interface. I'm wondering what are the biggest issues that users encounter as they try to scale the usage of those spreadsheets both from a technical, you know, data volume and data complexity perspective, but also from an organizational capacity?
[00:11:07] Unknown:
The first obvious 1, right, is just, you know, the analytical performance. Right? You can't be taking event data and and putting into Google Sheets, and it's gonna fall over. So I think that's that's 1 of the first use cases that we see and that we can really help out with with our tool. I think, you know, there's another thing there, which is, you know, like you mentioned, organizationally. Right? How many times have we seen spreadsheets with do not edit tabs, you know, in in red? And people sort of hold on to these, you know, they're afraid to let go of their Legos either for, you know, privacy concerns or, you know, not wanting to someone to break their massive model that they spent weeks building. So there is a a sense of, hey. This is because this is editable and because this is so easy to make changes to, I actually want to do the reverse and I wanna keep it close to the vest rather than having something that is a version controlled, you know, over a trusted dataset in real time thing that can be, I think, more easily shared. And if you have some permissions that make sense, it's something that can actually spread within an organization and help people actually use that more easily.
So I think those are the 2 biggest things that we see, and I think is a real reason why that people are trying to move away from, you know, having, you know, PII data and, you know, being passed around New York. You know, trying to move away from having, you know, your entire sales ops team run off of 1 Google Sheet. It's something that's, you know, not great for either the business or the data teams, and I think they're both looking for a better way to do their
[00:12:34] Unknown:
jobs. As you explored the capabilities of this metaphor in your own tool, what are some of the extensions or modifications to that paradigm of the spreadsheet and the ways that people typically interact with them that you needed to introduce to be able to account for some of these technical and organizational limitations in the default paradigm that people have used it for.
[00:12:59] Unknown:
Yeah. Definitely. And this has been an interesting adventure with DBT in particular. Right? You know, a big part of spreadsheets in the way that, you know, information is shared is you often wanna see and show the work behind your analysis. Right? And so I think leaning a lot of what's being done with the DAG and being able to show, hey. This spreadsheet is actually composed of, visually, these other items, right, of these other datasets. You know, these fact and dim tables were used in the creation of this. This pivot was made on top of this to create this chart. That's something that's really something we're trying to invest in and show people, hey. Just because this is a spreadsheet doesn't mean that it's not a powerful analytical tool. It doesn't mean that it's not extensible. It doesn't mean that you can't actually double click and see the lineage behind it to understand what breaking changes might happen. Right? And so I think that's the first part which has been super interesting, is is sort of the lineage I'll I'll summarize it as. Another 1 is it's really interesting is thinking of it in terms of more of components rather than these 1 off things, and I think this lends itself to the lineage stuff. But, you know, you have all of a sudden, now business users that are capable of creating these repeatable insights and these repeatable, almost, models, if if you will.
So how do you create a nice graduation path for something that is, you know, forked from the model that the data team has created? Business team has made some edits to it, and now you actually wanna make this something that's reusable and certify it so that it's available to the rest of the org. This is something that we're pretty excited about. And, you know, we're calling components sort of in the style of Figma, right, and design libraries that we see out there. And so creating a component library of something that actually can start to accrue value over time instead of just being another request response workflow between business and data teams is something that we're pretty excited about. Obviously, there's, you know, some of the guardrails that you have to create around formulas and, you know, they can't just be arbitrary.
You have to put in some guardrails so people don't, you know, completely choke the pipelines. So that's been an interesting 1 to see, you know, which ones are relatively noncontroversial for business teams to go and work with and also trying to, you know, account for performance across the board and making sure that, you know, just because something is available doesn't mean it's not gonna, you know, result in a 45 second query, though, where they'd bail. You mentioned working alongside DBT
[00:15:16] Unknown:
and building on top of the modern data stack. I'm wondering if you can give your perspective on the role that Canvas plays in the overall data ecosystem for an organization? Like, what are the utilities that it might replace? What are the utilities that it's going to augment? You know, who are the primary personas that are going to be interacting with it, and how are they going to collaborate with the other kind of data roles and stakeholders within the organization?
[00:15:46] Unknown:
Yeah. Totally. I think the first thing that we see really the primary pain point for data teams is, hey. We cannot actually continue to scale without for every 10 employees that we hire on the business side, we have to hire another analyst. Right? And so they're looking for a way to break that correlation. And so that's really where we come in and say, okay. Well, let's try to set a goal. Let's try to reduce your data request by 25%. And the way that we try to do that is saying, okay. You've already invested in your DBT project. Right? You've already created these, you know, wide tables, you know, are reasonable starting point for business teams to work with, but you don't really have an interface in which that's just available for exploration. Right? You might have an existing strategic sort of BI tool or set of dashboards.
You might be pumping, you know, this data into different systems with, you know, reverse ETL or customer data activation, I guess. You've had these systems for a while, but they're still not resulting in any reduction in your data requests. Right? And so what's broken there? What can we fix? And that's really our approach is saying, hey. We can complement these systems. Instead of you getting a request for something, why don't you actually respond back to the Canvas link and say, hey. Here's the model that you're looking for. I know you're gonna have 6 questions after I even give you back the original query that you wanted. You can actually ask those 6 additional questions here with just using your spreadsheet skills by just pivoting, by just creating some charts, by creating some formulas, by joining with some other tables, and we give them a relatively easy way to do that without even really knowing exactly what a SQL join, might be. And so, you know, that's really the primary sort of persona is working with data teams that are looking for a way to scale. And, of course, we have business teams that come to us too, and they want to just be able to have an interface that they can control over their warehouse or over their database even if they're early. And that's something in which we're really just trying to help them be able to move faster and do more with less.
And, yeah, those are sort of the core personas that we're working with today, primarily, you know, in b 2 b SaaS, you know, ecommerce sort of start up roles. But But I think the interesting trend that we're really trying to ride most is a lot of these operational roles 5 years, 10 years ago were primarily just doing day to day work and, you know, executing on tasks. Most of that work has been, you know or not most, but a lot of that work has been automated. Right? And so they're moving more and more into strategic roles no matter if you have an ops in your, you know, ops in your title. And so we have this pretty big skills gap between, you know, 90, 95% of the organization doesn't know SQL, but now they're expected to be a data driven role.
And so what are the tools? They don't really have a home today in terms of being able to do everyday tasks with the data that they need. And so that's really the trend that we're seeing and really wanna give these operators a home in the modern data stack that makes sense.
[00:18:36] Unknown:
And there have been a few other efforts to make exploration of data easier to do. You mentioned, obviously, business intelligence systems, and there have been a few iterations of those recently that have tried to reimagine some of the ways that you interact with it. 1 of the ones that comes to mind is something like Lightdash, where it builds on top of the DBT models and relies on some of the dimensional modeling to be able to say, you know, here is your domain object. Here are the different dimensions along which you can aggregate and slice it. Here are sort of the guardrails for that. I'm wondering what you see as kind of the juxtaposition of what you're doing at Canvas with some of the other ways that people have tried to approach the same problem of providing self-service data exploration in a way that is approachable to the business users without having to, you know, throw them in the deep end of go ahead and learn SQL. Good luck. I'll see you in 6 months.
[00:19:31] Unknown:
Yeah. Totally. I think there's obviously been a ton of tools that help for you know, helping with them learning SQL or making SQL more collaborative or, like you mentioned, being able to give them sort of sort of more exploration, you know, capabilities. I think the thing that we're really trying to lean in on is there are very analytically minded people throughout the org that need to do things beyond just a couple set of basic filters and slicing and dicing of data. They actually have models that are sitting in Google Sheets that are completely manual and updated with live data, you know, every Monday or every month. Right? And so I think that's the work that we're really trying to go after and say, hey. There's a better way to do this rather than in Google Sheets that, you know, are gonna start to fall apart after a few weeks worth of massive datasets and give you the ability to actually replicate those processes over real time data, over data that's, you know, going to be trusted and really try to bring those data and business teams together. So I think that's really the gap that we're trying to fill is giving citizen data scientists, I guess, is the way to call them, but, really, these analytically minded operators that need Excel like, you know, strength, Excel like power, but they need it with real time data. That's really where we see ourselves coming in.
[00:20:54] Unknown:
Unstruck is the data ops platform for your unstructured data. The options for ingesting, organizing, and curating unstructured files are complex, expensive, and bespoke. Unstruct Data is changing that equation with their platform approach to manage your unstructured assets. Built to handle all of your real world data, from videos and images to 3 d point clouds and to geospatial records to industry specific file formats, Unstruck streamlines your workflow by converting human hours into machine minutes and automatically alerting you to insights found in your dark data. Unstruck handles data versioning, lineage tracking, duplicate detection, consistency validation, as well as enrichment through sources including machine learning models, third party data, and web APIs.
Go to dataengineeringpodcast.com/unstruck today. That's u n s t r u k. And transform your messy collection of unstructured data files into actionable assets that power your business. And in terms of the actual Canvas platform itself, I'm wondering if you can talk through some of the ways that it is implemented and architected to be able to provide these data manipulation and data exploration capabilities while still being interactive and performant enough for people to be able to iterate on these questions that they have?
[00:22:14] Unknown:
Yeah. So 1 of the goals that we had first setting out was, obviously, it has to be fast. Right? If you're gonna compete with these these, you know, sort of incumbents, it needs to feel like a native app, but it has to be in the browser. Right? So our stack is pretty exciting. It's pretty new. It's Rust and WebGL and WebAssembly. Looking at a lot of the sort of, like, modern collaboration tools out there is really something that we try to model ourselves after. And, of course, you know, integrating with the best of breed, you know, parts of the modern data stack. Right? So integrating with the warehouses, integrating with ETL tools like 5 Tran and and Airbyte and and a couple others, and really trying to make sure that this is something that data teams are, again, not gonna see as some toy for business teams, but rather as a fully fledged real data tool on top of the modern data stack and something that, you know, the the the high standards that they should have.
[00:23:03] Unknown:
And as you have explored this problem space and dug further into actually implementing your own spreadsheet interface and figuring out how to provide the right paradigms and metaphors and build in the necessary escape hatches for when you start to exceed the complexity that spreadsheets are able to support. What are some of the ways that the design and goals of the product have changed and evolved since you started working on it? 1 of the first things was really
[00:23:31] Unknown:
realizing that, you know, you need to be able to let folks get far enough in their spreadsheet sort of exploration or spreadsheet modeling before they run into a point where they can't go any further. Right? And so you have a couple different options there. The first thing and probably the thing that's, you know, most exciting or gets people most excited when when I demo is you can actually open up any of the spreadsheet analysis, and we'll actually generate really nice CTEs and really nice SQL for data teams to actually go in and inspect and edit. And that change is bidirectional. So any changes that are made in the SQL will then actually find, you know, almost 1 to 1 sort of equivalent with spreadsheet interfaces.
And so that's sort of the first this almost escape patch, I'll call it, right, where, hey. I've gotten far enough. I maybe I've made a mistake, or maybe I just need help that's some bit over my skis. I can tag the data person. They can come in, and instead of trying to reverse engineer, you know, formulas and references in some sheet that you've never seen before, you can actually just open up the hood and see the SQL and make changes really quickly. And so that's pretty much our primary sort of escape hatch right now. Yeah. I I guess I'll stop there.
[00:24:40] Unknown:
And as far as being able to manage that translation, what are some of the spots where you've seen the impedance mismatch and some of the ways that it breaks down where somebody has a specific formula that they're trying to build based on their Excel knowledge, and it doesn't quite map cleanly to the SQL or vice versa where you have the SQL query and you're trying to figure out how do I actually map this into an Excel function that is understandable and isn't going to be, you know, 3, 000 characters long?
[00:25:10] Unknown:
I think the first thing so 1 of our our sort of goals when we sit down and prioritize, you know, functionality for the spreadsheet or functionality for the sequel is to make sure that there is a close 1 to 1 representation of that. 1 of the big decisions that we had to make was, do we wanna actually write back to the warehouse, and do we wanna actually write back to the data sources? And that's something that, you know, is raises a lot of concerns from, you know, you know, the CSOs that you that you talk to. So that's been 1 sort of, I guess, example of this in which we wanted to stay away from and stay away from things like, you know, maybe some data validation that doesn't make sense, obviously, in SQL.
But, yeah, I think there's a decent amount of these that can come up if you're not careful about actually prioritizing the right things, which I I think we've done a fairly good job of of doing to date.
[00:25:56] Unknown:
So in terms of the actual workflow for a business user who wants to understand what's happening in their product? What is the actual workflow for being able to figure out what are the datasets that I have? How do I pull that into a sheet, figure out what the appropriate subset is, build my analysis, share it with people, collaborate, that overall end to end journey of figuring out how to answer your own questions
[00:26:28] Unknown:
data library. Right? And so data teams have full control over, hey. Here's the, you know, the different levels of of models that we've prepared. Right? Here's the ones that are going to probably be the most popular, and so we use social proof there too to show, hey. You know, these are gonna be the things that are popular amongst finance teams or or marketing teams. Right? So it's part discovery, right, and showing them, hey. Here's probably what you're looking for and allowing them to search for it easily. I think the second part, which has been huge, is, you know, leveraging descriptions in DBT and being able to surface those in the front end. Right? A lot of frustration even, you know, I had when I you know, my flexible days is you often don't understand what the column is that you're looking at, right, or how the column was calculated.
And so we can actually take a lot of that metadata and actually surface it to the business teams so that they can more easily understand what they're looking at or which column is gonna be the right 1 for for them to work with. And then I think a lot of it is just exploration, honestly. Once they drag it onto their canvas, it's basically it looks like Figma. Figma was designed for data. You can see, you know, a lot of the sort of I can have 1 table. I can see the relation to its other table in terms of if it was joined or not. I can write some formulas over it. I can do some pivots, create some charts, show my lineage, or hide that away.
A lot of it is sort of this just, you know, iterative prototyping process, and this is something I think that's been particularly surprising for me is watching data teams actually want to use it for prototyping because a lot of that is, you know, done honestly with making SQL exports and then pulling it into sheets and looking for anomalies today. So that was definitely something that surprised me a little bit of seeing how data folks actually wanna use this tool as well. And so that's really sort of the journey is, you know, pulling in a couple different tables, you know, creating some pivots. Honestly, it's not too complicated and looking for anomalies from there and then continuing to refine it until you end up with something that, you know, can look like a dashboard with a series of charts, or it can look like a sheet report that you've had in Google Sheets with a bunch of formulas. It really can go in any direction from there. And then, of course, the most valuable part is actually sharing that. And so we have, you know, permissions that you can take something, you can promote it to a shared canvas that other people can see, and only that's managed by sort of a higher role to kinda make sure that governance is there from a human led perspective. Right? Not everyone on the team can have that capability.
And so, you know, ultimately, it's about sharing that and making it easier and then making it live as part of your data library as a component, you know, in perpetuity. So this, you know, isn't just a a series of 1 off explorations that are happening, but rather this, like, growing knowledge base, right, over time that isn't just contributed to by the data team. It's actually now you're getting input from more folks. You're starting to see who's using models more. You're starting to see who's not using models and able to prune those as you go. And, really, the result is hopefully a better knowledge base for the entire work.
[00:29:21] Unknown:
And in terms of the iterative process and managing the versions that have been the plague of everybody who's ever dealt with Excel. I'm wondering how you've approached that problem, and in particular, being able to manage things like snapshots in time where you say, I want to build this report to be able to capture, you know, sales figures for the month of June, for instance. And I don't want the sheet to automatically update when I get my new sales figures because I want to maintain this view of the data as it exists at the time that I make this analysis so that I can refer back to it as I do successive iterations of it, for instance, and just managing some of those different versioning requirements depending on the sort of varying use cases for how the data is being accessed and shared?
[00:30:07] Unknown:
Great question. So 1 of the nice architectural decisions that we've made is, you know, because it's a multiplayer application, right, you can actually see the multiple cursors moving around in real time working on the same data as if you were in Figma. You know, sort of every action is taken as part of a ledger. Right? And so we can actually version control each and every sort of stage of the analysis and allow you to revert back to 1 of those stages and eventually let you freeze to a particular stage so that you can just jam those numbers, like you mentioned, for a particular, you know, bookmark in time or snapshot in time. So I think that's 1 of the other, you know, really interesting things about this architecture is that you can actually now have a sort of a system that's living in 2 different in multiple different states at once without the headaches or the anxiety of having a sheet that's exposed to a bunch of people that can potentially,
[00:31:03] Unknown:
analysis being done in Canvas get fed back into things like DBT or the other analytical systems that an organization might be using.
[00:31:12] Unknown:
This is something that's really interesting, I think, in terms of the future of how data and business teams work together, where, you know, if we're going to start moving away from and shifting the onus of dashboard creation to to business teams, right, and have data teams really focus on data quality and data usability, I think this is gonna be a key, you know, part of the stack moving forward, which is how do you let someone go and take this insight and go and take this model and allow it to graduate back into your systems? And so for right now, this is a fairly simple, you know, interaction between the data and business teams. The data teams are able to see what the business teams are doing. They can collaborate and have a conversation around that there. They can also see which models are being used most often. Right? And so I think the really exciting thing is that we're, at least for now, creating opportunity for this conversation to happen, whereas before, it's, you know, completely out of purview. Right? And you have, you know, heads of data that are like, hey. If it's in a CSV, it's not our problem. Right? But it's like, okay. What if you end up with enough teams with their data in CSVs? That eventually is gonna be your problem because this last mile is broken.
And so I think that's 1 of the things that we're most excited about is, hey. Let's actually create a mechanism for business teams to be able to own part of this input and, you know, eventually work on productizing and automating this sort of graduation path back into your DBT library. But I think 1 of the the coolest moments that that we're having is watching this conversation happen between data and business teams that before was never really happening.
[00:32:44] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. DataFold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying. You can now know exactly what will change in your database.
DataFold integrates with all major data warehouses as well as frameworks such as airflow and DBT and seamlessly plugs into CI workflows. Visitdataengineeringpodcast.com/datafold today to book a demo with DataFold. In terms of that collaboration, you mentioned that Canvas is an application that allows for multiple people to be able to work on the same sheet and edit it. I'm sure you have some capabilities for being able to lock it from editing while somebody else is using it and, obviously, some of the data governance and oversight elements that you mentioned.
I'm just wondering what are some of the other ways that you've seen data teams and business users collaborate for being able to manage some of these component libraries that you mentioned or being able to handle the setup of the initial datasets and maybe some of the feedback from the data teams from the business teams into the data teams to say, you know, these are the kind of shapes of data or the domain objects that we want to be able to work with so that we can build our own analysis and just some of the conversations that that has opened up as far as how to even think about data modeling at the earliest stages of the process.
[00:34:32] Unknown:
I think 1 of the more interesting things has just been when you have a set of biz like, let's call it their biz ops team. Right? When the biz ops team comes in and sees between what's available in the d b 2 project or what's in the warehouse versus what they have in their own systems, I think that's often a conversation that, like, isn't really happening outside of this. And so you're actually watching the data teams and the business teams say, okay. Well, we need to answer these 10 questions, but let's see if they can actually answer 8 of those questions, you know, in the tool itself and then see the leftovers, and, okay. Now this is what we're gonna go scope and put into our DBT project.
And so I think we're seeing this, like, play out live rather than this just be some sort of scoping exercise between these different teams, and you don't even know if they're gonna end up using those models or not. So I think that's often the first thing that we see is, like, hey. We're gonna need to you know, more variations on the shape of this data, or we're gonna need a few more preset joins to to be available here. And and I think that's really exciting. I think beyond that, it's a matter of really trying to automate a lot of the consumption metrics that typically are, like, you're forced to build manually in in some of these tools. And so what we're really excited about is just being able to show on a per team basis, on a per model basis, how they're being consumed so that you can actually have some objective fact that you can bring back to your data team and say, hey. This isn't working. Hey. These are working. We should invest more in this area, And also look how they're using these actual models. Right? And you can point to the evidence. We'll let you drill in from the data library to the actual Canvas where it's being used so you can actually really grok what they're doing with that data. And so, yeah, I think those are 2 things that come to mind there. In terms of the applications of Canvas that you've seen as you have opened it up to your customers and as you have entered general availability, what are some of the most interesting or innovative or unexpected uses that you've seen?
I mean, 1 of the more interesting things was just honestly seeing people wanting to connect their data without without talking to us. We're doing a product like growth, you know, sort of strategy, which, for those who don't know, it's basically put up a free trial or freemium tier, which for data is, I think, relatively new. So that's been definitely surprising, just watching how many teams who just won't connect their warehouse and just wanna kick the tires on it without having to talk anyone. I think once you get into the tool and once you start to get into the use cases, I think it's super interesting seeing just how many different verticals and use cases are out there, but at the end of the day, how similar a lot of the data structure that they need is and how similar a lot of the common data sources that they need are. And so that's been something for us where we're really excited. We're creating essentially templates that will basically be, you know, out of the box packages that you can just say, hey. If you're looking at Stripe data and you're on this schema or if you're on this stack, you'll be able to actually consume that model automatically and so hopefully save your data team some time for some of those more, you know, shallower simpler models. Right? Stuff that doesn't have, you know, custom fields in it. So I think that's been an interesting insight is that there's so much variety and variability out there, but at the end of the day, a lot of the actual questions that are being asked are the same, and there's a huge opportunity to to automate some of that.
[00:37:37] Unknown:
As far as the experience that you've had of starting this company, building up the product, working with your customers, what are some of the most interesting unexpected or challenging lessons that you've learned as you've explored further into the space? I think as a founder, it's definitely just, you know, the
[00:37:53] Unknown:
first the amount of time that it takes to just build a product that people wanna use, right, especially in the data space. It's not a small task to to really make something that's different but something that's also reliable. There's just a lot of work that goes into that, and it's a lot of, you know, patience and working as hard as you can, which is 2 things that are pretty hard for humans. So I think that's been, you know, 1 of the parts of the journey. I think also just realizing how much, you know, it takes to really get just embedded with your customer. Right? Get on a texting basis with them. Make sure that they understand that you're not just a product, but that you're actually a service to them. And a lot of the reason why folks buy from early stage start ups is going to be, hey. Our external data team is a server. You're helping us think about these things. You are working with other companies. You're seeing what's out there. You know, going in, I thought a lot of it would just be, hey. We build software, and we we give it to you. But it's often much more than that, and they trust you and want to learn from you in terms of, hey. What are some best practices? Are we doing the right thing here? What are some things that you're seeing from the business that maybe don't make sense? Right? And so you're a consultant in a way, which I think was really surprising and pretty cool, honestly, because you get a chance to help someone out with not just, you know, helping them save time, but also helping them think about their role and thinking about their business.
[00:39:08] Unknown:
And for people who are looking for a way to make their data self serve and easier to access for the business users, what are the cases where Canvas is the wrong choice?
[00:39:19] Unknown:
You know, if you're looking for hardcore data analysis, if you're looking for Python and R, you know, tools that are gonna be very heavy for the data team, it's not gonna be something that we're gonna be a great fit. I think we complement a lot of tools in that case, and we do have a couple customers that they have, you know, multiple front ends, right, for different personas. And I honestly think that's how the arc of the modern data stack is bending. Right? It's bending towards best of breed tools for whatever use case. And so I would say that's probably, you know, something that's gonna be a clear sort of decision tree is if you need it for, you know, data science use cases, we're not gonna be a great fit.
[00:39:51] Unknown:
As you continue to build and iterate on the product, what are some of the things you have planned for the near to medium term, or any capabilities or use cases that you're excited to dig into?
[00:40:01] Unknown:
1 of the big things is I alluded to this earlier a little bit with templates. So taking in a lot of the requests that we're seeing in terms of what business users want to do across these different companies and and really work on programmatically, you know, implementing these models that can actually help save the data team time from having to build those and also give business teams, hey. Here's the 10 different wide tables that you would wanna see without requesting them from your data team, or, hey. Here's some of the metrics that you would wanna see without having to spend half an hour to build them out. Right? There really isn't a place that you know, beyond going to Google and how to build a metric and compiling all the data sources in a sheet offline, there isn't really a great way to understand how, you know, metrics are built and how they're calculated and best practices for those. So we're really excited that we're kind of taking the learnings from across our customers and creating, you know, templates and models and packages that can be consumed, you know, pretty automatically.
So that's, that's 1 of the big things that we're doing and providing that to folks that don't don't necessarily have a warehouse as well. We're putting them on a managed data stack instead of having to turn away, you know, customers that don't have a warehouse. So those are 2 really big things that we're working on that we can't wait to get out there.
[00:41:09] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap of the tooling our technology that's available for for data management today? Biggest gap in my mind is really, I think, opening up that conversation and that collaboration between
[00:41:31] Unknown:
the creators of the data and the consumers of that data. I think for all the advances that we've made on the tooling side, still understanding and being a systematic way and collaborating across the gap, I think, is really the biggest gap that I see still and something that we're, you know, we're most excited about with Canvas.
[00:41:49] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing on Canvas. It's definitely a very interesting tool and product and an interesting approach to a very real problem. So I appreciate all the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Yeah. Thanks so much, Tobias. Have a great 1. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Ryan Buick and Canvas
The Genesis of Canvas
Challenges with the Modern Data Stack
The Spreadsheet Interface and Its Persistence
Scaling Spreadsheets: Technical and Organizational Challenges
Canvas's Role in the Data Ecosystem
Self-Service Data Exploration
Canvas Platform Architecture and Implementation
User Workflow in Canvas
Collaboration Between Data and Business Teams
Lessons Learned and Future Plans for Canvas