Summary
Every organization needs to be able to use data to answer questions about their business. The trouble is that the data is usually spread across a wide and shifting array of systems, from databases to dashboards. The other challenge is that even if you do find the information you are seeking, there might not be enough context available to determine how to use it or what it means. Castor is building a data discovery platform aimed at solving this problem, allowing you to search for and document details about everything from a database column to a business intelligence dashboard. In this episode CTO Amaury Dumoulin shares his perspective on the complexity of letting everyone in the company find answers to their questions and how Castor is designed to help.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
- Have you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription.
- Your host is Tobias Macey and today I’m interviewing Amaury Dumoulin about Castor, a managed platform for easy data cataloging and discovery
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Castor is and the story behind it?
- The market for data catalogues is nascent but growing fast. What are the broad categories for the different products and projects in the space?
- What do you see as the core features that are required to be competitive?
- In what ways has that changed in the past 1 – 2 years?
- What are the opportunities for innovation and differentiation in the data catalog/discovery ecosystem?
- How do you characterize your current position in the market?
- Who are the target users of Castor?
- Can you describe the technical architecture and implementation of the Castor platform?
- How have the goals and design changed since you first began working on it?
- Can you talk through the workflow of getting Castor set up in an organization and onboarding the users?
- What are the design elements and platform features that allow for serving the various roles and stakeholders in an organization?
- What are the organizational benefits that you have seen from users adopting Castor or other data discovery/catalog systems?
- What are the most interesting, innovative, or unexpected ways that you have seen Castor used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Castor?
- When is Castor the wrong choice?
- What do you have planned for the future of Castor?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever had to develop ad hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the 1st DataSecOps platform that streamlines data access and security. Satori's DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift, and SQL Server, and even delegates data access management to business users, helping you move your organization from default data access to need to know access. Go to dataengineeringpodcast.com/satori, that's s a t o r I, today and get a $5, 000 credit for your next Satori subscription.
When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. You listen to this show to learn about all of the latest tools, patterns, and practices that power data projects across every domain.
Now there's a book that captures the foundational lessons and principles that underlie everything that you hear about here. I'm happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O'Reilly to publish it as 97 things every data engineer should know. Go to data engineering podcast.com/97 things today to get your copy. Your host is Tobias Macy. And today, I'm interviewing Amari Dumoulin about Castor, a managed platform for easy data cataloging and discovery. So, Amari, can you start by introducing yourself?
[00:02:23] Unknown:
Amari, I'm a CTO at Castor. So I joined this company 9 months ago, and I'm pretty much versed in a mix of data and software. That's pretty much my line of interest.
[00:02:36] Unknown:
And do you remember how you first got involved in the area of data? I got involved in data,
[00:02:41] Unknown:
like, 6 years ago by entering a market called services to consumer. It was a marketplace, pretty much like TaskRabbit. That's it. And then I understood that even if there was a job called data scientist, every company had a different understanding of what that meant. And in the end, I was more like a software and data and helping the operations. It was thrilling, but I started from the back end because this is the software I was growing from. I kept on digging and ended up with more of the data analysis and data engineering topics, which was something new for me yet 3 years ago.
[00:03:22] Unknown:
From that initial entry of getting exposed to working with data and digging into data engineering and data science, what is it that has led you to invest your time and energy into a company like Castor and helping with sort of the data discovery challenges?
[00:03:38] Unknown:
I can trace back why I joined Castor. I was before that at console console is like a really big FinTech in Europe and is growing super fast over 150, 000 customers in the b2b space. And I was pretty much alone at the start doing the data, and I had to grow the team. I had to grow from 1 to 10 people with the growing needs and widening topics along the time. And at 1 point, as the CEO at Castel joined, contacted me and was presenting me the product. And I I was like, oh, I didn't even know there was such an area because I was pretty much discovering it on my own. And so I helped them and we chatted a lot, and at some point, I decided to join them because they pitched it really well. And also, I had meeting with other competitors such as Adlon, and I really remember that, okay, I was thinking my company is not really ready yet for this, but this is a really interesting area to be working on. And I always wanted to bridge data and software, so that was the unique opportunity.
So I joined them maybe 9 months ago, and we've been really growing strong on the product ever since.
[00:04:48] Unknown:
Can you dig a bit more into what it is that you're building at Castor and maybe some of the specific focus of the product that you're building there? If I have to put Castor in a nutshell,
[00:04:58] Unknown:
it mostly revolves around data assets. That would be catalog, visualization, and tools such as DBT, more on the quality and engineering. There are lots of them. I'm not gonna dig more into it. But once you state that your selling points are data assets, you need to collect and connect the sources automatically so that there's no burden for the customer. And then you need to find and connect the dots between those assets because visualization is connected to a table, to a column, and it needs to be easy. And so we this is something happening behind the scenes, but this is the building block for us. And then our main interest is that we share business and internal knowledge happening at a company. For instance, business terms, documentation, specifics about columns, links between the columns that were not really showing in the logs or in the construction of the warehouse, but still really important for the users. So our main goal is to make it a smooth journey for new joiners in a company in terms of data, but also to make sure that knowledge is well spread and easily accessed.
[00:06:13] Unknown:
As far as the overall ecosystem of data catalogs and data discovery tools, as you mentioned, it's a fast growing market. There's a lot of focus on it right now, but it's still somewhat nascent where, as you mentioned before, you didn't really realize that this was even a category of offering available to you until you started engaging with the team at Castor. And so I'm wondering what you see as sort of the broad categories for data discovery and data catalog tooling in the ecosystem as it stands right now and some of the elements of differentiation that you see in the space? I think it's a really interesting question.
[00:06:52] Unknown:
I'll give my point of view, which like I said, everyone has a different take on how the data roles and the data sectors done. There's a really interesting piece by Andreessen Horowitz on it also. So for me, I would see, like, 1st, data engineering oriented solutions. I would say Monte Carlo is a solution maybe where it focuses on quality and monitoring. The persona here is the data engineer mostly. Then there's more of the governance oriented, and we've seen the hemoff such as Go Libra that have recently emerged from it. And for me, those focuses on fine grained access and auditing, mostly happening in bigger companies with higher price points.
It's an interesting topic, but it's not for everyone in the company. It's mostly for the DPO and some other highly focused members of the data team. So and the last part for me is the catalog oriented, and maybe I would cite Amundsen as the first example in terms of open source solution. For me, it focuses on searching and adding documentation only. And I would say to wrap up that we want to introduce at Kastor data discovery tooling. And in this encompasses catalog, which I just mentioned, but also brings a new family of features that support data consumers, which will nitpick a little bit of features in the different sectors I mentioned. But it's not meant to be the 1 stop shop for data engineers such as Monte Carlo is. It's meant to be for data consumers.
So if I pick an example, you can perfectly say, okay. The head of marketing, there's this table which tracks the revenue for by hour or by day depending on how specific it is or how intensive it is at your company. But I really wanna make sure that this table is well updated. This will be able to bring. But it's more business oriented than it's just, okay, is my workflow sane? Is it working well? Such as a Monte Carlo tool would would give. So I would say we're working in the same area, but there's different tick and there's a different, like, way to look at things. So we're starting out from catalog, but we're trying to build up from that term and try to open it up to wider meaning. And I think that your point of being focused on the data consumer
[00:09:10] Unknown:
as the primary target for the tool that you're building is an important thing to dig into. And you mentioned, for instance, Monte Carlo as being more focused on the data engineer and needing to be able to manage the data quality and health of the pipelines aspect. And so I'm wondering if you could maybe dig a bit more into some of the differentiation that that brings into the product space as far as being data consumer focused versus data producer focused and what you see as maybe being sort of the midpoint between those products as far as where they meet, where they integrate, and sort of how those different categories of focus fit in the broader ecosystem together.
[00:09:50] Unknown:
I mean, what is common? What is common is connecting to the sources and warehouses. For some of them, connecting to even more, like, specific tooling such as an airflow. We're both connecting to DBT. So pretty much the source of truth, the source of information, though, may be the same, but the way to look at it and, eventually, the end product, product, the application that is shown to the consumer, maybe we will dig on architecture a bit later. Our take is that we are talking to data analysts, to data scientists, to business intelligence people, and also what we call data stewards, the people that are embedded in the company and have a bit of knowledge of SQL, but maybe just a bit. And those people, what we try to bring them is like a super efficient search engine so that they can find what they're looking for. This is not something you would typically find in a Monte Carlo approach.
What we also want to show is, for instance, the popularity, how frequently used is a table, is an asset, and this is interesting. Also, the take on lineage, for instance. The way we show it is more to be able to see in a quick glance how it's constructed rather than just explore and be able to debug if something goes wrong with a huge map of every assets that happens. This would be overwhelming, and we try to be not obstructive, like, more simpler. For instance, we have a tool of automation, a set of automations such as column propagator, the way we call it. Maybe we're not the best at pointing terms. But what I really love about that feature is that in some instances, we've seen admins or data leads with the people in charge of modeling or analyst. They'd be able to document really fast a lot of columns because modelization was nicely designed. So a lot of of columns look the same or should have the same documentation. And we are hovering this. It's geared towards people documenting so that we can leverage the time spent on it. So those are the kind of tools. The UI, the features of automation, and the take of how to present the information so that it's easy for data analysts and not comprehensive as it should be for data engineer tool. That doesn't mean we're hiding things, but we try to put foremost what is most impactful and interesting for a data consumer.
[00:12:18] Unknown:
And so in terms of the opportunities for innovation in the space, as you mentioned, there's sort of the core elements of this is the source of truth for the data that we're all working from, where whether that comes from the dbt pipeline metadata or the data warehouse directly or some of the pipeline jobs. But there's definitely a lot of differences in terms of how you want to approach it and how you want to expose it as we dug into as far as the data consumer versus data producer focus. And so what are the areas of innovation that you see and that you are working towards with Castor to try and help sort of drive the state of the art forward as far as how we're
[00:13:00] Unknown:
interacting with data, how we're thinking about how to organize it, how to collect it, how to add useful and rich context to it. So maybe I can just ease off on what's the baseline. The baseline is about fetching those assets as we mentioned and compute lineage and a bit of information to make sure that we leverage every every time spent by the users, by the data analyst. Then digging into your question about what are the opportunities of innovation. So I would say that a lot of things were not possible, few years ago, even like 5 years ago, that's for sure. The exponential growth of Snowflake we've been seeing, for me, the tipping point of an overall transition to cloud warehouses.
This allows a transition to software as a service such as ours, approach for cataloguing. Whereas before, a lot of companies were really restricting the access to the warehouse or they had on premise warehouse, which could not allow such tools. So it was pretty much about offering open source software that you could install in your machines or in your private and sometimes public cloud, but it was pretty restricted. For me, this was the entry point. The fact that now a lot of companies are totally fine with going on a cloud warehouse. Would it be Redshift, Snowflake, or BigQuery?
Then another thing for me, which also is driving innovation crazy, is dbt as you mentioned it. I will not dig into what it does because you presented it really well. For me, why is it the mental piece of the recent evolution? Because it gives flexibility and autonomy to data or BI analysts. It may seem really like a simple thing to say, but it drives a lot of energy and a lot of focus out of the data engineering teams and makes them able to tackle governance and documentation issues such as Kastor is trying to tackle. So for me, those were like the enablers, the recent enablers. So we also are enjoying the fact that Moonstone pioneered in an over Gaffer tooling that were developed developed in house and sometimes open source. They paved the way for a broader understanding of what is a data catalog. Pretty much the same way Airflow kind of led the way for what is a data flow tool. So this is the base. So, you know, where we're going from this, it's pretty much interesting. We're trying to leverage the same way the ELT leveraged those maybe 5, 10 years ago when we 5 years ago, I'd say, when we saw the GRU, the the incredible growth from Fivetran or Stitch.
The tool we're building, such as I said, catalog or quality, we will have the standpoint of connection to those warehouses and sources and build a lot of interesting features of what we can infer, what we can compute, what additional information we can infer. I'll give you an example. We have, like, our own algorithm of computing popularity, but we can pretty much also try to infer when you have an editor Because we have all this information, we can power a much more powerful editor. I think Atlan has been trying to issue this this kind of things recently. But still, for me, there's a sort of a Chinese wall between meet metadata and data.
Whereas, if you allow deep access access in read to the data itself, which for the moment we don't ask for because we wanna stay in the space where customers are really confident about the level of security we're providing, then you enter another layer of a lot of things you can you can assist, your customer with. So I think a lot of people are gonna bring up interesting insights. I'd be really not surprised to see the same way we saw with GitHub providing, you know, this, bot that kinds of auto complete your code so that you can type code for you. It could be the same for SQL, like a pure knowledge. If you can see that some people do it always the same kind of joint and always the same kind of pattern, you can pretty much complete a lot of what they do.
Metabase has been a huge also, I didn't mention it so far, but there's been a huge incredible breakthrough in terms of self-service data. And this direction also can go much further, maybe in open source, but also with more proprietary software. So I'm pretty much poking out there in the future there, but I think it's definitely super interesting what's happening.
[00:17:32] Unknown:
Digging more into Castor itself, can you talk through a bit of the overall feature capabilities that you're building into it and some of the technical architecture
[00:17:42] Unknown:
that is powering the platform that you're creating? As I said, we're targeting the data users, analysts, BI engineers, but also people who have an access to data depending on the companies, but it could be product managers or marketing people. So in short, anyone who consumes and works with data can benefit from. So that's the level of what we offer. And as I said, you can find easily your assets. You can easily add business knowledge. We will automatically improve and recompute any assets you might have. And we're bringing automation, as I said, so that we have a few anecdotes on it later, but how some users, they can really have an epiphany in the fact that they can spend 5 minutes and document and a lot of assets. And this was not that easy if you had to commit a lot of code in dbt doc, for instance, which is a great tool, but it's more like a source of truth than it is a collaboration tool. So for me, we're bringing collaboration to this space and also making sure we track all sources of documentation and reconcile them.
So that's about what we offer. In terms of where the targets, I set it, and we might open up also a bit to the people, doing the data modeling. So sometimes it's data engineers, sometimes it's another persona, so that's about it. But we're really happy, and we have data engineers on the platform, but it's not meant to be a monitoring tool for them. So in terms of technical architecture and implementation, there's 2 big parts. First part would be the extraction side. It's pretty much resemble what an ETL will do. It'll pull or push assets from the customer sources. So I could I say warehouse or visualization tool. I say push or pull because sometimes we have logic that is given to the consumer for him or her to run it, and then he or she pushes. And sometimes we pull depending on the level of excess and the situations we're running in. So we are trying to develop a solution that makes our customer confident about the security so that we still can give them maximum value.
So the second part is the web app. Pretty much resemble a lot of software as a service with serving the consumer with a great UI and a back end that is tailored for this app. This is why also we wanted to separate both because 1 is more data engineer and algorithm and data flow oriented, and the other 1, it resembles a lot of SaaS in the architecture. I'm not saying it's off the shelf software, but I'm saying it's pretty more common. That's about the architecture. I'll be happy to dig into any more details you you wanna ask about that. And so in terms of the actual
[00:20:36] Unknown:
metadata extraction and doing the data discovery elements of it, I'm wondering if you can just talk through a bit more of how you handle sort of identifying what are the useful assets to pull out, doing any sort of entity resolution for being able to deduplicate records or data sources, and then some of the complexities that come in as far as doing the lineage calculation from the view that you have of the user's infrastructure.
[00:21:04] Unknown:
I remember the first time I was facing someone offering me, like, lineage computation out of the query logs. I was like, oh, how did you do this? And started asking a lot of questions. And then I got the chance to do it myself. And so it's not that easy. And from the start, we acknowledge that it was not exact science. The only exact science there could be is having the dbt file, for instance, because this is predictable. This is the source of truth. What we're trying to do is is inferring. So we're trying to have a target of, say, like, I don't know, 70, 80 percent of accuracy. I'm just saying this not as, like, I'm not engaging myself. Just saying this is the kind of the figures we're trying to look at. We have a base of test of our own queries and also better partners that allowed us to work to dig deeper into the queries to improve the product.
So how we do this? The problem is that if you parse every query with the grammar of the SQL, it can be really slow because, some customer, they might have a 100, 000 queries a day, 200, 000 even more. So it's really intense. And so we have, like, a cascade. What we have is first running through a faster layer that try to infer what we're looking at, if it's useful, and then a second layer that within this and still in a fast way, extracts to a source table. So it would be in a select. It would be the from and the join. And, also, if if there's a CTEs or subqueries, it needs to be smart and figure those out. What can be tricky is that sometimes you've got aliases that are only temporary to the dbt run or whatever other tool you use your transformation.
And that can be tricky. So you also need to take into account that those could perfectly only live for a few minutes and then don't appear in the catalog. Sometimes the catalog is a lag also, so you would think you have the table, but it's not showing yet. So a lot of small things happening. But, overall, the real challenge is about the speed of your processing. Another nice quick win would be to try to deduplicate, as you said, the those queries. So this is about the catalog. There are other types of issues such as, for instance, something called sharded tables where you have a radical such as, I don't know, users, and then you have a suffix, which would be the date. And internally, some warehouses are able to do the stitching so that when you query accounts or users, it seems like a 1 table, but it's like partition in Postgres.
Behind the scenes, there's a lot of tables. But in the queries, doesn't show like this in a catalog. So sometimes some power user features that are really useful in the warehouses turn out to be a lot of trouble for us in this field of inferring information. So we are improving, and we're trying to be as efficient in all of the warehouses technologies that we're covering right now. But, frankly, it's still ongoing. There's a fat tail of improvements, which will, I think, never completely master, but we're trying to go with the biggest check every time so that we can address our largest consumer base and also keep our current users pleased and delighted about what we do. And then in terms of the actual
[00:24:31] Unknown:
architectural aspects of how you're thinking about the design of the system to make it extensible and and being able to support the customer infrastructure, I'm wondering sort of, first, what are the data infrastructure components that you're targeting out of the box as far as if a customer, you know, needs to have a data warehouse set up in order to be able to really gain a lot of value from Castor? Or what are some of the extension points for being able to plug in other sources of data or other metadata information to be able to enrich the context that Castor is able to provide as people are doing discovery and discovery and management of the different assets within the organization.
[00:25:15] Unknown:
Maybe as you said, I don't think if you're a company that doesn't have yet a warehouse, I don't think Kastor is in its best spot because you won't leverage a lot of what is possible. For instance, the queries log doesn't exist in traditional relational databases such as MySQL or POSQUARES. You have some form of logs, but you don't have the extensive or comprehensive logs you would have in, warehouses, which is designed for retaining those. So lineage will not be easily computed. But you can tell me you have foreign keys and so on and so forth. So if you're doing analytics in a Postgres database, which I did at some point, so I'm not gonna shame anyone here.
But it's not for the best because I think you have a bigger fish to fry because you have a lot of other things in your data journey. But then again, we have a few customers with it, and we're serving them. But it means that it's only focusing on a catalog, and we're so much more. So that's about it. And then in terms of other sources, we could connect to as you you mentioned. For us, we want to connect to everything that is an easily reached artifacts. The DBT manifest is 1, but also connecting to your Tableau, Luker, Metabase instance, which are covering in more than 80% of the market with those I mean, with the people on cloud warehouse. I'm not saying the full market because this is kind of biased, but I explained earlier why we were into this field. So connecting to those, I think it's pretty efficient and we're trying to push this through and it's kind of easy once the security concerns are out of the way. But we've seen a pattern of a lot of companies more willing to push rather than to be pulled the data. So we need to develop toolings.
Would it be libraries? Would it be open source code? In the same way, Erbyte in the ELT space. We will have, at some point, to give libraries that are able to push and maybe a bigger API. Those are the kind of space we need to dig in if we want to have this integration which is smoother. For the moment, the 1 which is smooth is the warehouse. But the ones for the other tools, we have smart approaches to go quick such as an API with more generic visualization interface, but we wanna be more specific than generic because in Tableau, there are interesting assets such as explore, which you don't have in Metabase. But in Metabase, you have a source code which can be embedded and a different kind of queries. And in LUCA, you've get the LUCA ml, which is another layer of interesting source of information. So working our way through the specifics of the visualization is also kind of tricky, but we are trying to go through the maze.
[00:28:12] Unknown:
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo or Facebook ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you're looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hitouch for reverse ETL today. Get started for free at dataengineeringpodcast.com/hitouch. And I think it's worth talking a bit more about the treatment of visualizations and dashboards as assets within the data ecosystem of a company. Because most of the time, when you're thinking about data discovery and data cataloging, people automatically assume the tables and records in the data warehouse or the files that are in, you know, parquet and s 3 for their data lake. And so I'm wondering what if you can talk through some of the benefits of using the visualizations as a category of asset to track and some of the ways that that's used within Castor for being able to collaborate across the organization as a data consumer?
[00:29:27] Unknown:
Well, some data consumer can only access the visualization layer. They don't even have access to the warehouse layer. So, I mean, this is a first statement that can already kind of pin down the fact that if you only limit yourself to the warehouse, depending on the architecture and an organization of the company, you may very well only reach 5 people or 10 people, which is useful but doesn't bring that collaboration and knowledge base that we want to bring. So, typically, what I have in mind is I can state on Konta, for instance. I mean, reminding me how it worked. So you would have, like, a Tableau instance, which is more for reporting for the stakeholders, and you will have a Metabase instance, which is more for self-service.
Both serve a different purpose. Metabase is less organized, lot of queries, and this is also a different and very interesting topic about managing the kind of explosion of far west things that can happen in your self-service option. But yet, there are a couple or even, like, dozens or, I don't know, 50 dashboards which are highly valued, and those are kind of, supported by the data team. And because it's supported, we really want to have them into our product because we want to track them. We wanted to make sure that it's mentioned as highly popular so that people know, oh, okay. I wanna know how much users there are in our product, how many consumers in our b to c product. So I will type users, and then the first thing that will come up is maybe a table called users and a few others, but less popular. But definitely those dashboards, those visit visualization. They are super useful because this is where usually that business and data converge. It's mainly on this definition of the queries behind and the way to present data. So I would say data modeling and the queries that are the base into DBT of how your transformation happened is super impactful. But in the end, the the users in the business, would it be finance, would it be marketing, would it be product or operations, they're operating at visualization layer. So for us, because we want to reach out to these data stewards, data practitioners in those area, We wanna make sure we have this visualization. But, yeah, I think this is why we're interested. It's not as much like, you know, as a go to. Oh, no 1 is in this spot or not a lot of people. We should go there. It's more like we wanna make sure all the people that are touching data and consuming data, even nondirectly, but fruit of this VINZ tool can access it and the knowledge behind. And what can happen is that once they stumble upon this Tableau dashboard, they can poke the data analyst or even the data engineer saying, oh, this is the table behind it because we have lineage.
I'd be interested to learn a bit more. Maybe we can add a new column because we always need this, and I think think it'd be valuable to have it in the source of truth. Maybe I'm pushing a bit forward in the direction that would never exist, but definitely they're able to, you know, just open the door and see what's inside the house and not just be a shit out of it. And that's also interesting.
[00:32:51] Unknown:
Yeah. I definitely think that treating the visualization as a first class concern within the platform is useful because as you pointed out, the consumer of the data isn't necessarily going to be able to or even want to understand the table level metadata, and they just see, this is a report that shows me what I need to know or, you know, I need to know, do we have a report that answers this? And if not, then I need to ask for it. And then on the, you know, data analyst and data engineer side, if you do have a business user coming in saying, I need to be able to answer this question, then you can say especially if it's a large organization, you say, well, we already have this report over here. Is this what you want? If not, here's how we can modify it kind of a thing. And so I think that's definitely a great way to sort of open up that collaboration beyond just the data professionals into everybody within the business.
[00:33:41] Unknown:
Totally agree. And, also, I'll jump back on on what you just said about duplication. It could also help into seeing if anyone has already answered or tried to answer this because sometimes a database, for instance, the searching tool is crap. I mean, I love their product, but this is definitely a crappy part of it. And sometimes, also, people did things in their private space, and they could share it. And it could also be an opportunity because you have access to what they did on their side, but not public to say, okay. Maybe this 1 is also, valuable. So less reinventing the wheel. I mean, there will always be a kind of this because the queries are never exactly the same. But if you can limit it and with exploration, with people saying this is good enough, I'm not gonna ask the data team. I can work my way with this.
It's also a big plus for, you know, there's, overwhelmed data team with a lot of ticketing. I'm sure you've been you've stumbled upon this kind of organization, and we wanna support them too. Absolutely.
[00:34:44] Unknown:
And then talking through the overall of getting Castor set up within an organization and onboarding people to start leveraging that to do this discovery and collaboration around their data assets. Can you just talk through some of the steps that are involved and some of the user education that you've had to do to help people understand the value and benefit and how to get the most capabilities out of the product?
[00:35:09] Unknown:
Sure. I mean, there's a lot of explanation and talking through even before we try to go through the real deal onboarding process. Because of these sales cycle, which are super slow compared to, I mean, more of off shelf software such as, I don't know, Sentry or GitHub. Because of the fact that it's an onboarding process, more complicated, we need to educate more and make sure that we are aligned with the customer. But that happens more in the sales part. Then in terms of really the process of adding a warehouse, which is the meaty part and the standpoint for all of it, The customer opens an account, and then the admin does this. So the admin will walk through a quick warehouse setup. So we have guides, nicely guides for them.
It gives us access, and we have to happily explain to them why we need each role that we ask for, whatever the warehouse we're talking about. And then they add on themselves the warehouse in the product. And if they have visualization, they will also do it. And then start using it. Everything else is taken account by us. What I can also stress, like, more in terms of life cycle and in terms of added value is that we have someone doing product ops. So it's more of helping them bridge the gap of, I also want this, but it's not really a feature that we're gonna design now. But how can we squeeze in some sort of existing features so that you can have it? For instance, I have this Google Docs with documentations.
Could you load them for us? So we have this kind of also support on the side that helps them make it easier and better with semi automated actions. We totally understand that a lot of situations are different, that maybe they had initiative or documentation before, and they wanna make sure that this is also into Castor. And, also, we wanna make sure that if we don't cover all of their use case, I don't think we can, but we're trying to, we can sometimes find a way to squeeze in more without a lot of specific development, but more with smart manual actions.
[00:37:22] Unknown:
In terms of the product design, I'm wondering what have been some of the complexities that you've had to tackle to make it accessible and understandable to all the different roles that have to interact with it in different different contexts and just being able to bridge those differences in terms of experience and perspective and create a cohesive experience for everybody in the organization.
[00:37:46] Unknown:
It's definitely something hard. I can remember from a demo from Adeline, for instance, that I was amazed by a lot of things in the product, but, frankly, I had issues finding my way even on simple screenshots of what was this about. So this is why building on what we've seen before, we try to really have a smooth and intuitive UI. And we have feedbacks on it, which, which are really positive too. So maybe I can dig into the 2 personas. There's admin and there's regular user. The admin can oversee who is owner of an asset. He or she can also validate and can operate on this description propagation I mentioned earlier saying, okay. You've got opportunities to propagate this description to other columns of the same name, and this admin can.
We wanted to make sure that only an admin could do this because it's super impactful, super useful, but should be only given to people who are capable of having this responsibility. But what is available for all users is the fast search because this is what we're trying to bring them to find their way. So we want to allow them to find their way so everything is searchable easily. So that's about it. The lineage, for instance, we have a way to allow logical exploration. So you're in an asset. You can see his parent, his children, and work your way around. We didn't want to have, like, a Neo 4 j approach of everything shouted back at you. We wanted something more easily operable.
Another thing I can mention which really helps is the popularity. I've mentioned it several times. I think it's a fundamental element because it helps you immediately differentiate between something which is poorly used or frankly not used at all and something which is really a core table. I mean, it's pretty much the same as going on IMDB and checking what's a popular movie or not. It's a fast way to do it. And I think this has proven useful. We've had feedback on people saying, oh, but this table is useful, and I don't see it in the top popular.
So we tweaked the algorithm, and we try to improve them so that it's meaningful to people. And, frankly, we have good feedbacks on it. So that's 1 thing. Maybe 1 last thing I can mention is the way we present information. For instance, if you go on a table asset page, we try to give you what's more useful. 1st, the columns and this description and this name. And then we have, like, more of a side panel, more additional information you could hide or show. And because there's so much things we can infer, so many things, so much information that is available, but we don't wanna make it look like it's, you know, a Christmas time shop with a lot of thing glittering and just shouting too loud. This is also a take. We might be wrong, but I think we have the stake of trying to bring what we believe data users value the most and foremost.
[00:40:52] Unknown:
And then in terms of the usage of Castor and the customers that you've worked with, what are some of the most interesting or innovative or unexpected ways that you've seen it used?
[00:41:03] Unknown:
I wanted to say it works, and that's unexpected. But apart from that, that little pun, I think that we've seen interesting things such as, you know, power documenters. Yes. It's a thing that turns it into a game where they challenge themselves whenever someone in the company tells them, oh, this documentation could be improved. And they tease each other saying, okay, mine, they're pristine. Yours, there's some internal chat about it. So apart from the bad talking, but little t's, I think the gamification of it was intended. The fact that in the end, documenting is not thrilling. I mean, it's not the sexiest topic of all. But on the other side, consuming documentation is really useful. I could mention for me the best example of it would be Stripe as 1 of the best developer documentation existing.
I can't imagine writing it would be thrilling for people at this company, but they know that it's super, super useful in the end. And I think the same thing happened for me for those poor users. They're really proud of the way people use it. Another iteration of funny things would be, we've seen a user or several of them documenting 500 columns in less than 10 minutes, thanks to this proper propagation tool. So in the end, they covered, I don't know, 10%, 20% of coverage of documentation in their whole warehouse in less than 10 minutes. Maybe because they had done it right in the first place, but also because the tool allowed it.
So that was not expected. We would have thought maybe it's gonna leverage to them for 10, 20 columns, and 500 was definitely a high figure.
[00:42:40] Unknown:
And then in terms of your experience of working with Castor and helping to build out the product and using it to sort of do whatever testing you need to do, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:42:55] Unknown:
I think to keep the feet on the ground, we really learn about the wide span of ETL, DBT, and a warehouse usage across our customers. The same way I stated earlier that someone with a role of data engineer or data analyst could do a lot of different things depending on the organization and the needs of his or her company. The same happens in the way our customer set up and work their data flow. So, I mean, there's good practices. We've seen them, but there's no real standard. And what is challenging for us is making sure that our users, they can leverage Castor the same way. So to make sure that whatever the way they design their architecture, we can still take the information out, align them with the way we work, and feed it back to the the product so that it's useful. And as I said, this in terms of discovery, in terms of collaboration.
So clearly, that's for me the the biggest turn is that a lot of different tools, a lot of way to set it up. For instance, maybe to give an example, in DBT Cloud, you can have it in your infrastructure. But in your infrastructure, you can perfectly set it as 1 task in your airflow or try to split it in each model. You can perfectly have like a lot of different, staging environment or not. You can activate the test or not. I mean, I'm not gonna spend the next 10 minutes trying to pinpoint every little details about it, but I think you can see where I'm going is the fact that even if it's widely used tool with a lot of common base, every company can tweak it in a different way. And it's a bit challenging because some of the tweaks can kind of impair our ability to to see through the catalog, the links, and so on. For people who are interested in improving the collaboration
[00:44:50] Unknown:
and discovery of data within their organization, what are the cases where Castar might be the wrong choice?
[00:44:56] Unknown:
As I said earlier, if you're really early in your data journey, for instance, you don't yet have any an ELT, you still have ETL, or you still I did this at 1 point, few years ago, but I have confessed my sins since. What I've done was transforming from 1 Postgres stable in memory to another 1. Well, okay. Not the best option. It worked at some point until it didn't work. And this is where I choose to have a warehouse and ELT approach. And those are natural choices for me before you even try to dig into discovery or catalog tool. Yet I don't think it's a no go. I'm just saying that you will need to evolve to this warehouse at 1 point, and you will need to have this ELT growth to support the mental complexity of understanding what you have in hand.
I didn't weight that in early at Kontow. We did several several improvements to allow it, and I think that it's gonna be a burden or at least it's gonna take some time. You will surely better devote to benchmarking and setting up voice tool for me. Yet, if you already have a lot of, for instance, self-service data, even if you're like a 50 people company or 25 and almost everyone in the company does SQL, it could be already interesting to have Castor because of this. Yet, in most cases, if you don't have yet what I said is ELT and and the warehouse, I think it's not gonna be the best choice or it's gonna be too soon maybe.
[00:46:29] Unknown:
In terms of the ongoing work that you are doing with Castor, what are some of the near to medium term plans that you have for it and some of the areas for
[00:46:40] Unknown:
exploration that you're particularly excited for. It might not shine like a glowering sun, but we're trying to bring more integration to satisfy more customers. So this might seem like something really basic, but it's what we have in the the current plan is bringing more integration and also more in-depth. I mentioned Tableau. Tableau, you can do it easy with only dashboards. You can add another layer with we call tiles, what would be, like, the question behind the dashboard, and you can even go deeper with what is called Explore mentioned earlier. So I think in-depth and with integration definitely has a large footprint in the road map. That's important for us. But we see the same thing happening in Erbai, for instance, in terms of customers are expecting this. For instance, if they shift to a new solution, they wanna have it on Castor because they were already customers, and they will not understand why it is not there. So that's 1 part. Another really interesting part, which I love because I really love algorithms, would be more automation features, like I said, and also maybe a bit more quality oriented. But then again, if you're not Monte Carlo, more in the direction of empowering data users than data engineers to make sure that they have the level of information that they want. For instance, we were going in the direction of what is called tracking tables to make sure that the tables you're most interested in or the tables that are the most popular, whatever the reason is,
[00:48:13] Unknown:
those, we inform you of their quality or whether they are updated or not. And this is the kind of automation we have in mind. For anybody who wants to get along with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective, what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:48:36] Unknown:
As a European, and and frankly, the same thing is happening at a lower scale with the California law. I think we're lacking an easy to use data governance tooling to address GDPR. I mean, GDPR is, like, biggest nightmare slash joke. Don't wanna be, like, rude to anyone saying this among head of datas because they know they have to do it. But even if they wanna, you know, play by the hook and also be respectful to their consumers, it's not adding a lot of value, but it demands a lot of time, and there's no playbook. It's like, for instance, even if you wanna do it the best way you think of the way you interpret it a law or even the the piece of informations, like I said, the consultant might have given you.
You're still, like, in a bit in the dark. You're not sure. And in the end, if you get, like, the administration that pays you a visit and try to make a note of it, and that happened to Alan last year, and they have a nice medium article on it. Then even if you thought you had done it right, they're still gonna ask you a lot of things to improve. They're gonna take a lot of your time to make sure you're compliant with the way they perceive things. So in the end, trying to have this, like, tool or a health tool, health consulting thing in terms of GDPR, that's definitely an area which is not very sexy, but will bring a lot of value and a lot of ease of mind to a lot of people around the globe, specifically in Europe.
So that's 1 thing I think is a big gap. Also, self-service data is super powerful, but we've seen that it's really challenging in terms of governance because there's copycat of queries, there's redefinitions of PI terms, BI, you know, definitions such as the ARR. For instance, how much money am I making? What's the revenue? If you're not paying enough attention, it's self-service data. 4 different teams can reinvent their own definitions of this business term, and that's super dangerous. It allows a lot of people to monitor the activity, to make sure they have answers to their own issues. So it's a difficult trade off and definitely the area of deciding who can access to this table because this is kind of, you know, consolidated or validated and what area is more like, not rubbish, but less control.
This is something that Metabase has been trying to do, but, frankly, the metrics of right is a nightmare. I've sit through kind of an experiment to improve it, and it's difficult. So that's another layer is about
[00:51:16] Unknown:
importing self-service data without having too much of a mess. So, yeah, GDPR and self-service data are definitely super opportunities in the near future for me. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Kestler. It's definitely a very interesting product and an interesting problem area. So I'm definitely excited to see a lot of the activity that's happening there. I appreciate all the time and energy you're putting into trying to help provide solutions to that space. So thank you for, again, for your time, and I hope you enjoy the rest of your day. Thanks a lot, Tobias.
[00:51:49] Unknown:
Thanks also for all the energy of running this podcast, and congrats for your book.
[00:52:00] Unknown:
Listening. Don't forget to check out our other show, podcast.init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site of data engineering podcast dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on iTunes and tell your friends and coworkers.
Introduction and Overview
Guest Introduction: Amari Dumoulin
Journey into Data Engineering
Building Castor: Motivation and Vision
Data Cataloging and Discovery
Data Consumer vs. Data Producer Focus
Opportunities for Innovation in Data Management
Castor's Features and Technical Architecture
Metadata Extraction and Lineage Calculation
Visualization as Data Assets
Onboarding and User Education
Product Design and User Experience
Unexpected Uses and Lessons Learned
When Castor Might Not Be the Right Choice
Future Plans and Areas of Exploration
Biggest Gaps in Data Management Tooling
Closing Remarks