Data Discovery From Dashboards To Databases With Castor

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever had to develop ad hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data?

Satori has built the 1st DataSecOps

platform that streamlines data access and security.

Satori's DataSecOps

automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift, and SQL Server, and even delegates data access management to business users, helping you move your organization from default data access to need to know access.

Go to dataengineeringpodcast.com/satori,

that's s a t o r I, today and get a $5, 000 credit for your next Satori subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this

show. You listen to this show to learn about all of the latest tools, patterns, and practices that power data

projects across every domain.

Now there's a book that captures the foundational lessons and principles that underlie everything that you hear about here.

I'm happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O'Reilly to publish it as 97 things every data engineer should know.

Go to data engineering podcast.com/97

things today to get your copy. Your host is Tobias Macy. And today, I'm interviewing Amari Dumoulin about Castor, a managed platform for easy data cataloging and discovery. So, Amari, can you start by introducing yourself?

Amari,

I'm a CTO at Castor. So

I joined this company

9 months ago, and I'm pretty much versed in a mix of data and

software. That's pretty much my line of interest.

And do you remember how you first got involved in the area of data? I got involved in data,

like, 6 years ago

by entering a market called services to consumer. It was a marketplace,

pretty much like TaskRabbit. That's it. And then I understood that even if there was

a job called data scientist, every company had a different understanding of what that meant. And in the end, I was more like a software and data and helping

the operations.

It was thrilling, but I started from the back end because this is the software I was growing from.

I kept on digging and ended up with more of the data

analysis

and data engineering topics, which was something new for me yet 3 years ago.

From that initial entry of getting exposed to working with data and digging into data engineering and data science, what is it that has led you to invest your time and energy into a company like Castor

and helping with sort of the data discovery challenges?

I can

trace back why I joined Castor. I was before that at console console is like a really big FinTech in Europe and is growing super fast over

150, 000 customers in the b2b space.

And I was pretty much alone at the start doing the data, and I had to grow the team. I had to grow from 1 to 10 people

with the growing needs and widening topics along the time. And at 1

point, as the CEO at Castel joined, contacted me and was

presenting me the product. And I I was like, oh,

I didn't even know there was such an area because I was pretty much discovering it on my own. And so I helped them and we chatted a lot, and at some point, I decided

to join them because they pitched it really well. And also,

I had meeting with other competitors such as Adlon, and I really remember that,

okay, I was thinking my company is not really ready yet for this, but this is a really interesting area to be working on. And I always wanted to bridge

data and software, so that was the unique opportunity.

So I joined them maybe 9 months ago, and we've been really growing strong on the product ever since.

Can you dig a bit more into what it is that you're building at Castor and maybe some of the specific focus of the product that you're building there? If I have to put Castor in a nutshell,

it mostly revolves around data assets. That would be catalog, visualization,

and tools such as DBT, more on the

quality and engineering.

There are lots of them. I'm not gonna dig more into it. But

once you state that your selling points are data assets, you need to collect and connect the sources automatically so that there's no burden for the customer. And then you need to find

and connect the dots between those assets because visualization

is connected to a table, to a column, and it needs to be easy.

And so we this is something happening behind the scenes, but this is the building block for us. And then our main interest is that we share business and internal knowledge happening at a company. For instance, business terms,

documentation, specifics

about columns,

links between the columns that were not really showing in the logs or in the construction of the warehouse, but still

really important for the users. So our main goal is to make it a smooth journey for

new joiners in a company in terms of data,

but also

to make sure that knowledge is well spread and easily accessed.

As far as the overall ecosystem of data catalogs and data discovery tools, as you mentioned, it's a fast growing market. There's a lot of focus on it right now, but it's still somewhat nascent where, as you mentioned before, you didn't really realize that this was even a category of offering available to you until you started engaging with the team at Castor.

And so I'm wondering

what you see as sort of the broad categories

for data discovery and data catalog tooling in the ecosystem as it stands right now

and some of the

elements of differentiation that you see in the space? I think it's a really interesting question.

I'll give my point of view, which like I said, everyone has a different take on how the data roles and the data

sectors done. There's a really interesting piece by Andreessen Horowitz on it also.

So for me, I would see, like, 1st, data engineering oriented solutions. I would say Monte Carlo is a solution maybe where it focuses on quality and monitoring.

The persona here is the data engineer mostly.

Then there's more of the governance oriented, and we've seen the hemoff such as Go Libra that have recently emerged from it. And for me, those focuses on fine grained access and auditing, mostly happening in bigger companies

with higher price points.

It's an interesting topic, but it's not for everyone in the company. It's mostly for the DPO and some other

highly focused members of the data team.

So and the last part for me is the catalog oriented, and maybe I would cite Amundsen as the first example in terms of open source solution. For me, it focuses on searching and adding documentation only.

And I would say to wrap up that we want to introduce at Kastor data discovery tooling. And in this

encompasses

catalog, which I just mentioned, but also brings a new family of features that support data consumers, which will nitpick a little bit of features in the different

sectors I

mentioned. But it's not meant to be the 1 stop shop for data engineers such as Monte Carlo is. It's meant to be for data consumers.

So if I pick an example, you can perfectly say, okay.

The head of marketing,

there's this table which tracks the revenue for by hour or by day depending on how specific it is or how intensive it is at your company. But I really wanna make sure that this table

is well updated.

This will be able to bring. But it's more business oriented than it's just, okay, is my workflow

sane? Is it working well? Such as a Monte Carlo tool would would give. So I would say we're working in the same area, but there's different tick and there's a different, like, way to look at things. So we're starting out from catalog, but we're trying to build up from that term and try to open it up to wider meaning. And I think that your point of being focused on the data consumer

as the primary target for the tool that you're building is an important thing to dig into. And you mentioned, for instance, Monte Carlo as being more focused on the data engineer and needing to be able to

manage the data quality and health of the pipelines aspect. And so I'm wondering if you could maybe dig a bit more into

some of the differentiation that that brings into the product space as far as being data consumer focused versus data producer focused and

what you see as maybe being sort of the midpoint between those products as far as where they meet, where they integrate, and sort of how

those different categories of focus fit in the broader ecosystem together.

I mean, what is common? What is common is connecting to the sources and warehouses.

For some of them, connecting to even more, like, specific

tooling such as an airflow.

We're both connecting to DBT. So pretty much the source of truth, the source of information, though, may be the same, but the way to look at it and, eventually, the end product,

product, the application that is shown to the consumer, maybe we will dig on architecture a bit later.

Our take is that we are talking to data analysts,

to

data scientists, to business intelligence people, and also what we call data stewards,

the people that are embedded in the company and have a bit of knowledge of SQL, but maybe just a bit. And those people, what we try to bring them is like a super efficient

search

engine so that they can find what they're looking for. This is not something you would typically find in a Monte Carlo approach.

What we also want to show is,

for instance, the popularity, how frequently used is a table, is an asset, and this is interesting. Also, the take on lineage, for instance. The way we show it

is more to be able to see in a quick glance how it's constructed rather than just explore and be able to debug if something goes wrong with a huge map of every assets that happens. This would be overwhelming, and we try to be

not obstructive, like, more simpler. For instance, we have a tool of automation, a set of automations such as column propagator, the way we call it. Maybe we're not the best at pointing terms. But what I really love about that feature is that in some instances,

we've seen admins or data leads

with the people in charge of modeling or analyst. They'd be able to document really fast a lot of columns because modelization was nicely designed. So a lot of of columns

look the same or should have the same documentation. And we are hovering this. It's geared

towards people documenting so that we can leverage the time spent on it. So those are the kind of tools. The UI,

the features of automation,

and the take of how to present the information so that it's easy for data analysts and not comprehensive

as it should be for data engineer tool. That doesn't mean we're hiding things, but we try to put foremost what is most impactful and interesting for a data consumer.

And so in terms of the opportunities for innovation in the space, as you mentioned, there's sort of the core elements of

this is the source of truth for the data that we're all working from, where whether that comes from the dbt pipeline metadata

or the data warehouse directly or some of the pipeline jobs.

But

there's definitely a lot of differences in terms of how you want to approach it and how you want to expose it as we dug into as far as the data consumer versus data producer focus.

And so what are the

areas

of innovation that you see and that you are working towards with Castor

to try and help sort of drive the state of the art forward as far as how we're

interacting with data, how we're thinking about how to organize it, how to collect it, how to add useful and rich context to it. So maybe I can just ease off on what's the baseline. The baseline is about fetching those assets as we mentioned and compute lineage and a bit of information

to make sure that we leverage every every time spent

by the users, by the data analyst. Then digging into your question about

what are the opportunities

of innovation. So I would say that a lot of things were not possible,

few years ago, even like 5 years ago, that's for sure. The exponential growth of Snowflake we've been seeing,

for me, the tipping point of an overall transition to cloud warehouses.

This allows a transition

to software as a service such as ours, approach for cataloguing. Whereas before, a lot of companies were really

restricting the access to the warehouse or they had on premise warehouse, which

could not allow such tools. So it was pretty much about offering open source software that you could install in your machines or in your private and sometimes public cloud, but it was pretty restricted. For me, this was the entry point. The fact that now a lot of companies are totally

fine with going on a cloud warehouse. Would it be Redshift,

Snowflake, or

BigQuery?

Then another thing for me, which also is driving innovation

crazy, is dbt as you mentioned it. I will not dig into what it does because you presented it really well. For me, why is it the mental piece of the recent evolution?

Because it gives flexibility

and autonomy to data or BI analysts.

It may seem really like a simple thing to say, but it drives a lot of energy and a lot of focus out of the data engineering teams and makes them able to tackle governance and documentation issues such as Kastor is trying to tackle. So for me, those were like

the enablers, the recent enablers. So

we also are enjoying the fact that Moonstone pioneered in an over Gaffer tooling that were developed developed in house and sometimes open source. They paved the way for a broader understanding of what is a data catalog. Pretty much the same way Airflow

kind of led the way for what is a data flow tool. So this is the base. So, you know, where we're going from this, it's pretty much interesting. We're trying to

leverage the same way the ELT leveraged those maybe 5, 10 years ago when we 5 years ago, I'd say, when we saw the GRU, the the incredible growth from Fivetran or Stitch.

The tool we're building, such as I said, catalog or quality,

we will have the standpoint of connection

to those warehouses and sources and build a lot of interesting features of

what we can infer,

what we can compute,

what additional

information we can infer. I'll give you an example. We have, like, our own algorithm of computing popularity, but we can pretty much

also try to infer

when you have an editor Because we have all this information, we can power a much more powerful editor. I think Atlan has been trying to issue this this kind of things recently.

But still, for me, there's a sort of a Chinese wall between meet metadata and data.

Whereas, if you allow deep access access in read to the data itself, which for the moment we don't ask for because we wanna stay

in the space where customers are really confident about the level of security we're providing,

then you enter another layer of a lot of things you can you can assist, your customer with. So I think a lot of people are gonna bring up interesting

insights. I'd be really not surprised to see the same way we saw with GitHub providing, you know, this, bot that kinds of auto complete your code so that you can type code for you. It could be the same for SQL,

like a pure knowledge. If you can see that some people do it always the same kind of joint and always the same kind of pattern, you can pretty much complete a lot of what they do.

Metabase has been a huge also, I didn't mention it so far, but there's been a huge

incredible

breakthrough in terms of self-service data. And this direction also can go much further,

maybe in open source, but also with more proprietary

software. So I'm pretty much poking out there in the future there, but I think it's definitely super interesting what's happening.

Digging more into Castor itself, can you talk through a bit of the overall

feature capabilities

that you're building into it and some of the technical architecture

that is powering the platform that you're creating? As I said, we're targeting the data users, analysts, BI engineers, but also people who have an access to

data depending on the companies, but it could be product managers or marketing people. So

in short, anyone

who consumes and works with data can benefit from.

So that's the level of what we offer. And

as I said,

you can find easily your assets. You can easily

add business knowledge.

We will automatically

improve and recompute any assets you might have. And we're bringing automation, as I said, so that we have a few anecdotes on it later, but

how some users,

they can really have an epiphany in the fact that they can spend 5 minutes and document and a lot of assets. And this was not that easy if you had to commit a lot of code in dbt doc, for instance, which is a great tool,

but it's more like a source of truth than it is a collaboration tool. So for me, we're bringing collaboration to this space and also

making sure we track all sources of documentation

and reconcile them.

So that's about what we offer. In terms of where the targets, I set it, and we might open up also a bit to the people, doing the data modeling.

So sometimes it's data engineers, sometimes it's another persona, so that's about it. But we're really happy, and we have data engineers on the platform,

but it's not meant to be a monitoring tool for them. So in terms of technical architecture and implementation,

there's 2 big parts. First part would be the extraction side. It's pretty much resemble what an ETL will do. It'll pull or push

assets from the customer sources. So I could I say warehouse or visualization tool. I say push or pull because sometimes

we have logic that is given to the consumer for him or her to run it, and then he or she pushes. And sometimes we pull depending on the level of excess and the situations we're running in. So

we are trying to develop a solution that makes our customer confident about the security so that we still can give them maximum

value.

So the second part is the web app.

Pretty much resemble a lot of software as a service

with serving the consumer with a great UI and a back end that is tailored for this app. This is why also we wanted to separate both because

1 is more data engineer and algorithm

and data

flow oriented, and the other 1, it resembles a lot of SaaS in the architecture. I'm not saying it's off the shelf

software,

but I'm saying it's pretty more common. That's about the architecture. I'll be happy to dig into any more details you you wanna ask about that. And so in terms of the actual

metadata extraction

and doing the data discovery elements of it, I'm wondering if you can just talk through a bit more of

how you handle sort of identifying what are the useful assets to pull out, doing any sort of entity resolution for being able

to deduplicate records or data sources, and then some of the complexities that come in as far as doing the lineage calculation

from the view that you have of the user's infrastructure.

I remember the first time I was facing someone offering me, like, lineage computation out of the query logs.

I was like, oh, how did you do this? And started asking a lot of questions.

And then I got the chance to do it myself.

And so it's not that easy. And

from the start, we acknowledge that it was not exact

science. The only exact science there could be is having the dbt file, for instance, because this is predictable.

This is the source of truth. What we're trying to do is is inferring. So we're trying to have a target of, say, like, I don't know, 70, 80 percent of accuracy. I'm just saying this not as, like, I'm not engaging myself. Just saying this is the kind of the figures we're trying to look at.

We have a base of test of our own queries and also better partners that allowed us to work to dig deeper into the queries to improve the product.

So how we do this? The problem is that if you parse every query with the grammar of the SQL, it can be really slow because, some customer, they might have a 100, 000 queries a day, 200, 000 even more.

So it's really intense.

And so we have, like, a cascade.

What we have is first running through a faster layer that try to infer what we're looking at, if it's useful, and then a second layer that within this and still in a fast way,

extracts to a source table. So it would be in a select. It would be the from and the join. And, also, if if there's a CTEs or subqueries, it needs to be smart and figure those out. What can be tricky is that sometimes you've got aliases

that are only temporary to the dbt run or whatever other tool you use your transformation.

And that can be tricky. So you also need to take into account that those could perfectly only live for a few minutes and then don't appear in the catalog.

Sometimes the catalog is a lag also, so you would think you have the table, but it's not showing yet. So a lot of small things happening. But, overall,

the real challenge is about the speed of your processing.

Another nice quick win would be to try to deduplicate, as you said, the those queries. So this is about the catalog. There are other types of issues such as, for instance,

something called sharded tables where you have a radical such as, I don't know, users, and then you have a suffix, which would be the date. And internally,

some warehouses are able to do the stitching so that when you query accounts or users, it seems like a 1 table, but it's like partition in Postgres.

Behind the scenes,

there's a lot of tables. But in the queries, doesn't show like this in a catalog. So sometimes

some power user features that are really useful in the warehouses turn out to be a lot of trouble for us in this field of inferring information.

So

we are improving, and we're trying to be

as efficient in all of the

warehouses

technologies that we're covering right now. But, frankly, it's still ongoing.

There's a fat tail of improvements, which will, I think, never completely master, but we're trying to go

with the biggest check every time so that we can address our largest consumer base and also keep our current users pleased and delighted about what we do. And then in terms of the actual

architectural aspects of how you're thinking about the design of the system to make it extensible and and being able to support the customer infrastructure, I'm wondering sort of, first, what are the

data infrastructure components that you're targeting out of the box as far as if a customer,

you know, needs to have a data warehouse set up in order to be able to really gain a lot of value from Castor?

Or what are some of the

extension points for being able to plug in other sources of data or other metadata information

to be able to enrich the context that Castor is able to provide as people are doing discovery and discovery and management of the different assets within the organization.

Maybe as you said, I don't think if you're a company that doesn't have yet a warehouse, I don't think Kastor is in its best spot because you won't leverage a lot of what is possible. For instance, the queries log doesn't exist

in traditional relational databases

such as MySQL or POSQUARES. You have some form of logs, but you don't have the extensive or comprehensive logs you would have in, warehouses, which is designed for retaining those.

So

lineage

will not be easily computed. But you can tell me you have foreign keys and so on and so forth. So if you're doing analytics

in a Postgres database,

which I did at some point, so I'm not gonna shame anyone here.

But it's not for the best because I think you have a bigger fish to fry because you have a lot of other things in your data journey. But then again, we have a few customers with it, and we're serving them. But it means that

it's only focusing on a catalog, and we're so much more. So that's about it. And then in terms of other sources, we could connect to as you you mentioned.

For us,

we want to connect to everything that is an easily reached artifacts. The DBT manifest is 1,

but

also connecting to

your Tableau,

Luker,

Metabase instance, which are covering in more than 80% of the market with those I mean, with the people on cloud warehouse. I'm not saying the full market because this is kind of biased, but I explained earlier why we were into this field. So connecting to those, I think it's pretty efficient

and we're trying to push this through and it's kind of easy once the security concerns are out of the way. But we've seen a pattern of a lot of companies more willing to push rather than to be pulled the data. So we need to develop toolings.

Would it be libraries? Would it be open source code?

In the same way, Erbyte in the ELT space. We will have, at some point, to give

libraries that are able to push and maybe a bigger API. Those are the kind of space we need to dig in if we want to have this integration which is smoother. For the moment,

the 1 which is smooth is the warehouse. But the ones for the other tools,

we have smart approaches to go quick such as an API with more generic

visualization

interface, but we wanna be more specific than generic because in Tableau,

there are interesting assets such as explore, which you don't have in Metabase. But in Metabase,

you have a source code which can be embedded and a different kind of queries. And in LUCA, you've get the LUCA

ml, which is another layer of interesting source of information. So

working our way through the specifics of the visualization

is also kind of tricky, but we are trying to go through the maze.

Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo or Facebook ads?

Hightouch is the easiest way to sync data into the platforms that your business teams rely on.

The data you're looking for is already in your data warehouse and BI tools.

Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems.

No more scripts, just SQL. Supercharge your business teams with customer data using Hitouch for reverse ETL today. Get started for free at dataengineeringpodcast.com/hitouch.

And I think it's worth talking a bit more about

the treatment of

visualizations

and dashboards as assets within the data ecosystem of a company. Because most of the time, when you're thinking about data discovery and data cataloging,

people automatically assume

the tables and records in the data warehouse

or the files that are in, you know, parquet and s 3 for their data lake. And so I'm wondering what if you can talk through some of the

benefits

of

using the visualizations

as a category of asset to track and some of the ways that that's used within Castor

for being able to collaborate across the organization as a data consumer?

Well, some data consumer can only access the visualization

layer. They don't even have access to the warehouse layer. So, I mean, this is a first statement that can already kind of pin down the fact that if you only limit yourself to the warehouse, depending on the architecture and an organization of the company, you may very well only

reach 5 people or 10 people, which is useful but doesn't

bring that collaboration

and knowledge base that we want to bring. So, typically,

what I have in mind is I can state on Konta, for instance. I mean,

reminding me how it worked. So you would have, like, a Tableau instance, which is more for reporting

for the stakeholders, and you will have a Metabase instance, which is more for self-service.

Both serve a different purpose.

Metabase

is less organized,

lot of queries, and this is also a different and very interesting topic about managing the kind of explosion of far west things that can happen in your self-service

option. But

yet, there are a couple or even, like, dozens or, I don't know, 50 dashboards which are highly valued, and those are kind of, supported by the data team.

And because it's supported,

we really want to have them into our product because we want to track them. We wanted to make sure that it's mentioned as highly popular so that people know, oh, okay. I wanna know how much users there are in our product, how many consumers in our b to c product. So I will type users, and then the first thing that will come up is maybe a table called users and a few others, but less popular. But definitely those dashboards, those visit visualization.

They are super useful because this is where usually

that business and data converge. It's mainly on this definition of the queries behind and the way to present data. So I would say data modeling and the queries that are the base into DBT of how

your transformation happened is super

impactful. But in the end, the the users

in the business, would it be finance, would it be marketing, would it be product or operations,

they're operating

at visualization layer. So for us, because we want to reach out to these data stewards, data practitioners

in those area, We wanna make sure we have this visualization. But, yeah, I think

this is why we're interested. It's not as much like, you know, as a go to. Oh, no 1 is in this spot or not a lot of people. We should go there. It's more like

we wanna make sure all the people that are

touching data and consuming data, even

nondirectly,

but fruit of this VINZ tool can access it and the knowledge behind. And what can happen is that once they stumble upon this Tableau dashboard,

they can poke the data analyst or even the data engineer saying, oh, this is the table behind it because we have lineage.

I'd be interested to learn a bit more. Maybe we can add a new column

because we always need this, and I think think it'd be valuable to have it in the source of truth.

Maybe I'm pushing a bit forward in the direction that would never exist, but definitely

they're able to, you know, just open the door and see what's inside the house and not just be a shit out of it. And that's also interesting.

Yeah. I definitely think

that treating the visualization

as a first class concern within the platform is useful because as you pointed out, the consumer of the data isn't necessarily going

to be able to or even want to understand

the

table level metadata, and they just see, this is a report that shows me what I need to know or, you know, I need to know, do we have a report that answers this? And if not, then I need to ask for it. And then on the, you know, data analyst and data engineer side, if you do have a business user coming in saying, I need to be able to answer this question,

then you can say especially if it's a large organization, you say, well, we already have this

report over here. Is this what you want? If not, here's how we can modify it kind of a thing. And so I think that's definitely a great way to sort of open up that collaboration

beyond just the

data professionals into everybody within the business.

Totally agree. And, also, I'll jump back on on what you just said about duplication.

It could also help into

seeing if anyone has already answered or tried to answer this because sometimes a database, for instance, the searching tool is crap. I mean, I love their product, but this is definitely a crappy part of it.

And sometimes, also,

people did things in their private

space, and they could share it. And it could also be an opportunity because you have access to what they did on their side, but not public to say, okay. Maybe this 1 is also, valuable.

So less reinventing the wheel. I mean, there will always be a kind of this because

the queries are never exactly the same. But if you can limit it and with exploration, with people saying this is good enough, I'm not gonna ask the data team. I can work my way with this.

It's also a big plus for, you know, there's,

overwhelmed data team with a lot of ticketing. I'm sure you've been you've stumbled upon this kind of organization, and we wanna support them too. Absolutely.

And then

talking through the overall

of getting Castor set up within an organization and onboarding people to start leveraging that to do this discovery and collaboration around their data assets. Can you just talk through some of the steps that are involved and some of the user education that you've had to do to help people understand the value and benefit and how to get the most capabilities out of the product?

Sure. I mean, there's a lot of explanation

and talking through

even before we try to go through the real deal onboarding process.

Because of these sales cycle, which are super

slow compared to, I mean, more of off shelf software such as, I don't know,

Sentry or GitHub.

Because of the fact that it's an onboarding process, more complicated,

we need to educate more and make sure that we are aligned with the customer. But that happens

more in the sales part.

Then in terms of really the process of adding a warehouse, which is the meaty part and the standpoint for all of it,

The customer opens an account, and then the admin does this. So the admin will walk through

a quick warehouse setup. So we have guides, nicely guides for them.

It gives us access,

and we have to happily

explain to them why we need each

role that we ask for, whatever the warehouse we're talking about.

And then they add on themselves the warehouse in the product. And if they have visualization, they will also do it. And then start using it. Everything else is taken account by us.

What I can also stress,

like, more in terms of life cycle and in terms of added value is that we have someone doing product ops. So it's more of helping them bridge the gap of,

I also want this, but it's not really a feature that we're gonna design now. But how can we squeeze in some sort of existing features so that you can have it? For instance,

I have this Google Docs with documentations.

Could you load them for us? So we have this kind of also support on the side that helps them

make it easier and better with semi automated actions.

We totally understand that a lot of situations

are different, that maybe they had initiative or documentation before, and they wanna make sure that this

is also into Castor. And, also,

we wanna make sure that if we don't cover all of their use case, I don't think we can, but we're trying to, we can sometimes find a way to squeeze in more without a lot of specific development, but more with

smart

manual actions.

In terms of the

product design,

I'm wondering what have been some of the

complexities that you've had to tackle to make it accessible and understandable to all the different roles that have to interact with it in different

different contexts and just being able to

bridge those

differences in terms of experience and perspective

and create a cohesive experience for everybody in the organization.

It's definitely something hard. I can remember from a demo from Adeline, for instance, that

I was amazed by a lot of things in the product, but, frankly, I had issues finding my way even on simple screenshots of what was this about. So this is why building on what we've seen before, we try to really have a smooth

and intuitive

UI.

And we have feedbacks on it, which, which are really positive too. So

maybe I can dig into the 2 personas. There's admin and there's regular user. The admin can oversee who is owner of an asset. He or she can also validate

and can operate on this description propagation I mentioned earlier

saying, okay. You've got opportunities to propagate

this description to other columns of the same name, and this admin can.

We wanted to make sure that only an admin could do this because it's super impactful, super useful,

but should be only given to people who are

capable of having this responsibility.

But what is available for all users is the fast search because this is what we're trying to bring them to find their way. So we want to allow them to find their way so everything is searchable easily.

So that's about it. The lineage, for instance, we have a way to allow logical exploration. So you're in an asset. You can see his parent, his children, and work your way around. We didn't want to have, like, a Neo 4 j approach of everything

shouted back at you. We wanted something more easily

operable.

Another thing I can mention which really helps is the popularity. I've mentioned it several times. I think it's a fundamental

element

because it helps you

immediately

differentiate between something which is poorly used or frankly

not used at all and something which is really a core table.

I mean, it's pretty much the same as going on IMDB

and checking what's a popular movie or not. It's a fast way to do it. And I think this has proven useful. We've had feedback on people saying, oh, but this table is useful, and I don't see it

in the top popular.

So we tweaked the algorithm, and we try to improve them so that it's meaningful to people. And, frankly, we have good feedbacks on it. So that's 1 thing. Maybe 1 last thing I can mention is the way we present information.

For instance, if you go on a table asset page,

we try to give you what's more useful. 1st, the columns and this description and this name. And then we have, like, more of a side panel, more additional information you could

hide or show.

And because

there's so much things we can infer, so many things, so much information that is available, but we don't wanna

make it look like it's, you know, a Christmas time shop with a lot of thing glittering and just shouting too loud. This is also a take. We might be wrong, but I think we have the stake of trying to

bring what we believe data users value the most and foremost.

And then in terms of the usage of Castor

and the customers that you've worked with, what are some of the most interesting or innovative or unexpected ways that you've seen it used?

I wanted to say it works, and that's unexpected.

But apart from that, that little pun, I think that we've seen interesting things such as, you know, power documenters.

Yes. It's a thing that turns it into a game where they challenge themselves whenever someone in the company tells them, oh, this documentation could be improved. And they tease each other saying, okay, mine, they're pristine. Yours, there's some internal

chat about it.

So

apart from the bad talking, but little t's, I think the gamification of it was intended. The fact that in the end,

documenting

is not thrilling. I mean, it's not the sexiest topic of all. But on the other side,

consuming documentation is really useful. I could mention for me the best example of it would be Stripe as 1 of the best developer documentation existing.

I can't imagine writing it would be thrilling for people at this company, but they know that it's super, super useful in the end. And I think the same thing happened for me for those poor users. They're really proud of the way people use it. Another iteration of funny things would be,

we've seen a user or several of them documenting 500 columns in less than 10 minutes, thanks to this proper propagation tool. So in the end, they covered, I don't know, 10%, 20% of coverage of documentation in their whole

warehouse in less than 10 minutes. Maybe because they had done it right in the first place, but also because the tool allowed it.

So that was not expected. We would have thought maybe it's gonna leverage to them for 10, 20 columns, and 500 was definitely a high figure.

And then in terms of your experience

of working with Castor and helping to build out the product and using it to sort of do whatever testing you need to do, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I think to keep the feet on the ground, we really learn about the wide span of ETL,

DBT, and a warehouse usage across our customers. The same way

I stated earlier that someone with a role of data engineer or data analyst could do a lot of different things depending on the organization and the needs of his or her company.

The same happens in the way our customer

set up and work their data flow.

So, I mean, there's good practices.

We've seen them, but there's no real standard. And what is challenging for us is making sure that our users, they can leverage Castor the same way. So to make sure that whatever

the way they design their architecture,

we can still

take the information out, align them with the way we work, and feed it back to the the product so that it's useful. And as I said, this in terms of discovery, in terms of collaboration.

So clearly, that's for me the the biggest turn is that a lot of different tools,

a lot of way to set it up. For instance, maybe to give an example, in DBT Cloud, you can have it in your infrastructure.

But in your infrastructure, you can perfectly set it as 1 task in your airflow

or try to split it in each model.

You can perfectly have like a lot of different,

staging environment or not. You can activate the test or not. I mean, I'm not gonna

spend the next 10 minutes trying

to pinpoint every little details about it, but I think you can see where I'm going is the fact that even if it's widely used tool with a lot of common base,

every company

can tweak it in a different way. And it's a bit challenging because some of the tweaks can kind of impair our

ability to to see through the catalog, the links, and so on. For people who are interested in improving the collaboration

and

discovery of data within their organization, what are the cases where Castar might be the wrong choice?

As I said earlier, if you're really early in your data journey, for instance, you don't yet have any an ELT, you still have ETL, or you still I did this at 1 point,

few years ago, but I have confessed my sins since.

What I've done was

transforming from 1 Postgres stable

in memory to another

1. Well, okay. Not the best option. It worked at some point until it didn't work. And this is where I choose to have a warehouse

and ELT approach. And those are natural choices for me before you even try to dig into discovery or catalog tool.

Yet I don't think it's a no go. I'm just saying that you will need to evolve to this warehouse at 1 point, and you will need to have this ELT growth to support the mental complexity of understanding what you have in hand.

I didn't weight that in early at Kontow.

We did

several several improvements to allow it, and I think that it's gonna be

a burden or at least it's gonna take some time. You will surely better devote

to benchmarking and setting up voice tool for me. Yet, if you already have a lot of, for instance,

self-service data, even if you're like a 50 people company

or 25

and almost everyone in the company does SQL, it could be already interesting to have Castor because of this.

Yet, in most cases, if you don't have yet what I said is ELT and and the warehouse,

I think it's not gonna be the best choice or it's gonna be too soon maybe.

In terms of the ongoing work that you are doing with Castor, what are some of the near to medium term plans that you have for it and some of the areas for

exploration that you're particularly excited for. It might not shine like a glowering sun, but we're trying to bring more integration to satisfy

more customers. So

this might seem like something

really basic, but it's what we have in the the current plan is bringing more integration and also more in-depth. I mentioned Tableau.

Tableau, you can do it easy with only

dashboards. You can add another layer with we call tiles, what would be, like, the question behind the dashboard, and you can even go deeper with what is called Explore mentioned earlier.

So I think

in-depth

and with integration

definitely

has a large footprint in the road map. That's important for us. But we see the same thing happening in Erbai, for instance, in terms of

customers are expecting this. For instance, if they shift to a new solution,

they wanna have it on Castor because they were already customers, and they will not understand why it is not there. So that's 1 part. Another really interesting part, which I love because I really love algorithms,

would be

more automation

features, like I said, and also maybe a bit more quality oriented. But then again, if you're not

Monte Carlo, more in the direction of empowering data users than data engineers

to make sure that they have the level of information that they want. For instance, we were going in the direction of what is called tracking tables to make sure that the tables you're most interested in or the tables that are the most popular,

whatever the reason is,

those, we inform you of their quality or whether they are updated or not. And this is the kind of automation we have in mind. For anybody who wants to get along with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective,

what you see as being the biggest gap in the tooling or technology that's available for data management today.

As a European,

and and frankly, the same thing is happening at a lower scale with the California law. I think we're lacking an easy to use data governance tooling to address GDPR.

I mean, GDPR is, like, biggest

nightmare slash joke. Don't wanna be, like, rude to anyone saying this among head of datas

because they know they have to do it. But even if they wanna, you know, play by the hook and also be respectful to their consumers,

it's not adding a lot of value,

but it demands a lot of time, and there's no playbook. It's like, for instance, even if

you wanna do it the best way you think of the way you interpret it a law or even the the piece of informations, like I said, the consultant might have given you.

You're still, like, in a bit in the dark. You're not sure.

And in the end, if you get, like, the administration that pays you a visit and try to make a note of it, and that happened to Alan last year, and they have a nice medium article on it.

Then

even if you thought you had done it right, they're still gonna

ask you a lot of things to improve. They're gonna take a lot of your time to make sure you're compliant with the way they perceive things. So in the end, trying to have this, like, tool or

a health tool, health consulting thing in terms of GDPR, that's definitely an area which is not very sexy,

but will bring a lot of value and a lot of ease of mind to a lot of people

around the globe, specifically in Europe.

So that's 1 thing I think is a big gap.

Also, self-service data is super powerful, but we've seen that it's really challenging in terms of governance because there's copycat of queries,

there's redefinitions

of PI

terms, BI, you know, definitions such as the ARR. For instance, how much money am I making? What's the revenue? If you're not paying enough attention, it's self-service data.

4 different teams can reinvent their own definitions

of this business term, and that's super dangerous.

It allows a lot of people to monitor the activity,

to make sure they have answers to their own issues. So it's a difficult trade off and definitely the area of

deciding

who can access to this table because this is kind of, you know, consolidated

or validated

and what area is more like, not rubbish, but less control.

This is something

that Metabase

has been trying to do, but, frankly, the metrics of right is a nightmare. I've sit through kind of an experiment to improve it, and it's difficult. So that's another layer is about

importing self-service data without having too much of a mess. So, yeah, GDPR and self-service data are definitely super opportunities in the near future for me. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Kestler. It's definitely a very interesting product and an interesting problem area. So I'm definitely excited to see a lot of the activity that's happening there. I appreciate all the time and energy you're putting into trying to help provide solutions to that space. So thank you for, again, for your time, and I hope you enjoy the rest of your day. Thanks a lot, Tobias.

Thanks also for all the energy of running this podcast, and congrats for your book.

Listening. Don't forget to check out our other show, podcast.init

atpythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of data engineering podcast dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on iTunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links