Summary
Building a well managed data ecosystem for your organization requires a holistic view of all of the producers, consumers, and processors of information. The team at Metaphor are building a fully connected metadata layer to provide both technical and social intelligence about your data. In this episode Pardhu Gunnam and Mars Lan explain how they have designed the architecture and user experience to allow everyone to collaborate on the data lifecycle and provide opportunities for automation and extensible workflows.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box.
- Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
- Your host is Tobias Macey and today I’m interviewing Pardhu Gunnam and Mars Lan about Metaphor Data, a platform aiming to be the system of record for your data ecosystem
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Metaphor is and the story behind it?
- On your site it states that you are aiming to be the "system of record" for your data platform. Can you unpack that statement and its implications?
- What are the shortcomings in the "data catalog" approach to metadata collection and presentation?
- Who are the target end users of Metaphor and what are the pain points for each persona that you are prioritizing?
- How has that focus informed your priorities for user experience design and feature development?
- Can you describe how the Metaphor platform is architected?
- What are the lessons that you learned from your work at DataHub that have informed your work on Metaphor?
- There has been a huge amount of focus on the "modern data stack" with an assumption that there is a cloud data warehouse as the central component that all data flows through. How does Metaphor’s design allow for usage in platforms that aren’t dominated by a cloud data warehouse?
- What are some examples of information that you can extract through integrations with an organization’s communication platforms?
- Can you talk through a few example workflows where that information is used to inform the actions taken by a team member?
- What is your philosophy around data modeling or schema standardization for metadata records?
- What are some of the challenges that teams face in stitching together a meaningful set of relations across metadata records in Metaphor?
- What are some of the features or potential use cases for Metaphor that are overlooked or misunderstood as you work with your customers?
- What are the most interesting, innovative, or unexpected ways that you have seen Metaphor used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Metaphor?
- When is Metaphor the wrong choice?
- What do you have planned for the future of Metaphor?
Contact Info
- Pardhu
- @PardhuGunnam on Twitter
- Mars
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- Metaphor
- DataHub
- Transform
- Supergrain
- MetriQL
- dbt
- OpenMetadata
- Pegasus Data Language
- Modern Data Experience
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook ads? Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you're looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hitouch for reverse ETL today. Get started for free at data engineering podcast.com/hitouch. Your host is Tobias Macy. And today, I'm interviewing Pardoo Gunnam and Mars Lan about Metaphor Data, a platform aiming to be the system of record for your data ecosystem. So, Pardoo, can you start by introducing yourself? Hi, Tobias. Nice to be back on your podcast.
[00:01:45] Unknown:
I am Pardhu. I am a COO and cofounder of Meta for Data. I've been in the data platforms for the last decade. I started with AWS resource optimization for EC 2, and data visualization, then eventually worked on Metrics or what we call it, like, unified metrics platform at LinkedIn. And, eventually, from there, my interest moved towards metadata, which is a very related space, and joined the metadata team at LinkedIn and been, like, engineering manager leading growing the team along with Mars for the 2 years. And, yep, now extending the space, working on metadata for feels like a lot of time now.
[00:02:23] Unknown:
And, Maras, how about yourself?
[00:02:25] Unknown:
Hi. Maras, I'm the CTO and cofounder of MetaPhor. Like Pudu mentioned, we sort of went back at LinkedIn. I actually have a slightly different background. I didn't start with a traditional data background. I was at Google working on Google Cloud, early days of Google Cloud, so more like an infra sort of type of roles, and then recruited into LinkedIn to be the metadata team specifically. And spent about 3, almost 4 years over there solving various interesting metadata related problem
[00:02:52] Unknown:
before we came out and started the floor. Going back to you, Pardue, you mentioned that you spent some time also on the dedicated metric store, and I know that that's an area that's been seeing a lot of activity and attention lately. I'm curious what your sort of brief take is on the state of that ecosystem in terms of the transform folks and always, like,
[00:03:17] Unknown:
I'm always, like, proud about, like, being the founding engineer working at but we got it, like, 5 years or 6 years ago at LinkedIn. The whole thing started with the data democratization. Right? Like, hey. How do you define and, like, bring enable data data consumers, like data scientists and data analysts to directly define and create new datasets or metrics? So it depends upon the view. Like, a lot of companies approach this problem by defining YAML files or some kind of JSON configurations of, like, defining metrics and, like, translating them into actual data pipelines to generate those on regular time. Like, this is essentially what, like, LinkedIn did as part of Unified Metrics platform, which we worked for quite a, like, long time. And interesting aspect you mentioned about the metric store for for us is here is DBT. Right? So you you must have been following recently. DBT also, like, went into create creation of metrics, which is a very natural path in my opinion.
Now you have defined your pipelines and configurations through DBT. Now next step next step is to define and, like, specify a particular view, which is really a metric on top of these tables.
[00:04:20] Unknown:
And so for folks who wanna learn more about how you both got into data and some of your background there, I'll refer them back to the previous interview that you were both on, and I'll add links to the show notes for that. But that brings us to what you're building now at Metaphor. And I'm wondering if you can just start by giving a bit of an overview about what it is that you are building there and some of the story behind how you decided to turn that into a business and some of the motivation that keeps you interested and excited and spending your time and effort on that. So as we kinda alluded previously,
[00:04:51] Unknown:
there's definitely a big head LinkedIn heritage story going on here, but I think it's very relevant. So I'll sort of briefly talk about it as well. So when I first joined in, you know, I was leading the metadata team. This is a very new team at the time, actually. And the problem they're trying to tackle was mostly, how do we free up the data engineering team to work on things that are more data engineering, oriented than try to sort of handhold their stakeholder, essentially data science, you know, business analyst and whatnot to find the thing that they need.
So that was kind of the initial search and discovery bend of the metadata. But by the time I joined, the the focus quickly shift on to GDPR, obviously, because GDPR at the time was a huge deal for all these consumer Internet companies. What sort of trigger that is is, hey. You know, you guys your your team is the only team that actually have visibility into what LinkedIn has in terms of the data ecosystem. So naturally, that's the very first step to do compliance because you kinda need to know what you have before you decide what to do with them. So naturally, the team sort of start building out the entire infrastructure for enforcing GDPR related issue compliance related issues. And that is the time when sort of it hit us and say, hey, look, the stuff that we gather for search and discovery, it turns out that a big chunk of it is very, very relevant for compliance.
And then the light bulb went on our heads and say, hey. Maybe we should build this as a platform rather than building it as a point solution for individual use cases. We we literally started from the ground up building a good metadata platform. We often joke about that. We build a kind of crappy product that sit on the top of a very solid metadata platform, because there's a huge use case of, hey, we have pipelines and, you know, system that needs to grab the metadata and do actions in real time. So you need to have that sort of ability to scale your platform. And then our intuition was actually right after GDPR, there's many other use cases started popping up along the same similar line. Hey, you guys have 80% of the metadata, can we just bring in another additional 20%, and then our problem will be solved at that point? So quickly, it started from, like I said, the search engine discovery move into GDPR, and then start moving into data governance.
This is actually 1 thing that Pradeep didn't mention when you talk about the the the UFP platform. It was so successful in a short period of time having, you know, I forgot what's the exact number, 30,000 or 60,000 metrics? 30,000. 30,000 metrics. Right? So there's no way a company will have 30,000 metrics. Right? The reason behind that is because there isn't any governance around it. So quickly, you end up having duplicated metrics, or, actually, they're significantly similar to each other, but not leveraging each other at all in any sense. So the platform quickly got into this data governance piece as well. How do we make sure that we have a way of growing our data ecosystem, allowing people to, you know, freely self serve and democratize, but it's not a complete organic growth that has no control whatsoever.
So data governance is kind of the next area we got in. And also, we also got in in parallel, got into the whole AI ML DevOp. You know, LinkedIn at the time was pushing an initiative called Pro ML, which is productive machine learning. So they wanna have all these improv pieces that help the machine learning practitioner to be very productive in their machine learning process. And then metadata was a big part of it as well. They need to be able to find the feature that they're looking for. They need to find a model they're looking for. They need to know what state they're in. They need to know which 1 has are good, which 1 are not bad, and which 1 are deployed to prod, which 1 is small back, etcetera, etcetera. So the same platform actually also powered that particular use case. So at that time we realized, hey, look, something that we build is actually very interesting, because it kind of became the center of the universe, where a lot of other use cases can sort of just branch out from there. And we also realized, hey, this is probably something and of course, at the time, we also open source the platform as data outside and gain significant traction. Actually, in fact, I believe it is still the number 1 open source project out there in this area.
So that is the time when we start realizing, hey, this thing that we build probably has a lot of value outside of LinkedIn. Even though we build a specific LinkedIn, we can see that other company also have similar problem. And so that's when we've decided, hey, we should come out and, you know, solve this problem for other companies as well. And we feel like we're in a unique position because we've been through all these things and seeing what work and whatnot and decide we can really in a good position to build a product that can solve problem for many other companies. So so that's kind of the genesis of the company. That's why we decided to leave LinkedIn and start on our own.
[00:09:30] Unknown:
1 of the things that's notable is that on the landing page when you first go to the Metaphor data website is that it claims that you're trying to be the system of record for the data platform, and it differs a little bit from the data catalog nomenclature that a lot of people have been hearing about recently. And I'm wondering if you can just unpack that statement and how you settled on that particular phrasing and what that means in terms of the actual capabilities of the platform and how it might differ from how people think about data catalogs as a product category?
[00:10:03] Unknown:
So this is very interesting. Right? Like, if you observe, like, in the last 1 decade, like, even earlier, like, a lot of the industry, like, work has been put into, like, breaking the silos in data, right, and pulling in all the data into, yeah, the data lakes or data warehouses and, like, enabling, like, multiple analytics and, you know, predictions and everything on top of that. So a similar trend also is we we have observed on metadata. Like, so there is lot of silos within the metadata between these ecosystems, and, like, there is lot of value we have noticed, like, by which we can create by breaking these silos between the data, data systems.
So without that, we have observed that, like, all the data applications, like, try to interact and grab, like, metadata from each other, creating, like, end square connections between each other to, like, share and, like, exchange this metadata. And with that comes, like, a lot of problems around, hey. Did both of us, like, read similar set of metadata or, like, what point of in time? How did it evolve? Everyone has to solve, similar set of problems of, like, consistency or version control and everything within this metadata exchanges as well. So we saw that that's a good parallel between the actual data ecosystem problems and the metadata ecosystem problems. Right? And what we envision as by calling as a system of record is essentially, like, hey. It's a platform a metadata platform which can actually group on all these metadata into, like, a single place and, like, provide a consistent view of the metadata for all, like, both people consumption and as well as the system consumption for the metadata.
And you are absolutely right. Like, a lot of our competition and, like, a lot of the existing products on, view it and, like, name it as a data catalog, which is 1 of the applications which we believe is 1 of the applications for this aggregated metadata. Right? And, like, traditionally, it also can came from the notion that, like, data catalogs are a solution for data discovery or, to be more specific, technical data catalogs are a solution for data discovery, which we believed in as well for a long time. To be honest with, it was not sufficient or did not work well. Right? So that's why we strongly believe being the system of record by being a good metadata platform, aggregating this would solve the problem holistically, right, across, like, multiple use cases without requiring us to build very point to point solutions. Right? Like, data catalog is 1 point solution, and next time you want to handle metadata, like, change management, something you build a different application, or you want to solve, like, resource optimization, you build a different application. Like, it the a lot of these things have, like, good overlap, which you can, like, leverage through a metadata platform.
Hope that makes sense. Yeah. Definitely.
[00:12:48] Unknown:
And it also puts me in mind of the work that is happening at the open metadata project of trying to be this universal substrate for metadata in a data ecosystem and that the metadata is just the foundational building block for all of these different views and applications to be built on top of it, of which the data catalog is just 1. And others are things like data quality, data governance, you know, lineage tracking, things like that. I'm wondering if you have spent any time looking at the work that they're doing there as far as their approach to metadata as this universal substrate and any lessons that you either have been able to learn from that or some of the, maybe, points of difference in terms of the philosophy of how you think about metadata as the substrate from how they're approaching it of this very well defined, schematized model?
[00:13:37] Unknown:
I think the goal is very similar. And, you know, in fact, I believe the OpenMedidata team has a wealth of experience in data as well. And that's why they saw these coming, so to speak. Right? They didn't choose to build a point solution. They choose to build a platform. We kind of joke about how OpenAI data is a better version of Data Hub to some extent, you know, mostly because Data Data Hub is subjected to some of the technological choices that LinkedIn sort of place limits on, right? So and over metadata being a brand new thing, they have a lot of freedom in choosing what to do there. You know? So, yeah, I think the ideas are very, very similar. In terms of metadata modeling, actually, in fact, the 2 are very including us as well. Right? We're all very similar in a sense. The idea is you wanna be strongly schematized, but not to the extent where you essentially become the least common denominator.
Because if you think about it, obviously the most extendable data model will be a dictionary. You can just put key value there and then you can extend it all day long you want. But, of course, that will be completely useless when someone using it because you basically say, I'm essentially schema less. You go and figure out what's in there. So the user is gonna have a bad day using your completely extendable metadata. Alright? So but on the other extreme, if you are, you know, really trying to be very, very specific about things. Right? You say, hey. Look. I I have a very strong opinion about this thing, about that and all that, which is great for your user because your users know exactly what to expect there. But then you're not gonna be able to capture all the rich metadata, right? Because system are bound to differ, right? And if you just try to slot everything, right, to just throw it back into round holes every time, then you end up having a round pack. So you have the least common denominator in that sense. So how do you have a strongly schematized system, and also at the same time, very extendable? It's in and of itself with a very interesting engineering challenge. It's a challenge in the sense because the entire stack had to go with this model, it had to evolve with this model as it evolves, right? Not just your storage, your API layers, you know, even your UI kinda have to, you know, automatically update as you update your model. That itself is an interesting challenge. It's just the challenge that we had at Data Hub that we're trying to solve inside of LinkedIn to some extent, and we are still solving it today. I don't think anyone's got a perfect solution yet. But there's a couple of good choices that was made there in open metadata versus DataHub, versus what we're doing.
For example, DataHub using Pegasus, which is like a very weird language that's developed by LinkedIn and, you know, even though it's open source, it's not actually not super popular outside of LinkedIn. You know, metadata chose JSON schema, which is a very well understood schema language, standardized, and also has a great a strong connection to open API as well. And that is also the path that we've chosen. Right? We wanna choose a language that people are fairly familiar with, have a lot of strong support in terms of community and tooling and all that and being able to evolve based off that.
[00:16:36] Unknown:
In terms of the target users for MetaPhor, I guess, where are you focusing your primary efforts as you're launching the product? And how does that inform the features that you're building and the user experiences that you're designing to be able to go to market with it and help to provide value to people who are starting to onboard onto the Metafor platform?
[00:16:57] Unknown:
So this is a very interesting aspect. Right? Like, so often comes up a question like, hey. Are we building it for inch data engineering team, or are we building a tool for data, like, rest of the company or the business analyst or, like, the business user? So the, like, the answer is for both. Right? Like, there is so much of overlap of, like, requirements and, like, what use cases, what can be solved for both of the people. It almost warrants that, like, you need, like, a common tool which which can solve in a in this common, like, in a single place, which for both both sets of the users. Right? So we see it as 3 different personas to be more specific. 1 is the data producer or, like, data engineering team or, data analytics engineer persona. Right? Like, who is actually creating and, like, and real pipelines or, like, maintaining the pipelines or creating very specific core assets for rest of the company to, like, consume.
1 of the most common goals for these teams is data democratization. We're enabling, like, data creation, like, and consumption across the company. But data demonstration is great, but, like, it also kind of led to a lot of data sprawling. Right? Like, often, we end up with, like, very small companies with, like, thousands of dashboards and thousands of, like, tables. Imagine, like, 1 is, yeah, you you have, like, lot of resources spent across these dashboards and, like, assets. But any migration or any change which they have to make is a nightmare to manage. Right? Like, the small data team, which is data producer team, which is our engineering team, which is supporting a huge company with, like, 1 is to, like, 100 of customers is very daunting.
Being these common central team of data, they often become also, like, bottleneck or or, like, common resource to kind of ask any data related questions. Right? Which is sometimes they do have knowledge. Like, sometimes they're not even the right people to kind of answer those things. So they end up being routing people for, like, hey. Maybe this is created by that particular group. Can you go and, like, talk to them? Why they did it? Like, why did they pull this particular column? Things like that? And this is essentially, like, a lot of support for being the data team and, like, being enabling the data stack rather than, like, building core, solving core engineering problems around data. Right? Like, that is what their, actual main aspect of the job. So this is 1 problem, like, persona which we try to help with and solve through metaphor.
2nd person is data consumer. Like, if you see the same problem from, like, other side. Right? Like, hey. The company is moving extremely for data adoption and, like, data democratization, but, like, I'm able to create but, like, also end up with, like, lot of questions around, like, trusting the data. Right? Like, I see too many copies or, like, I see too much of similar data, like, which is the right 1 to use, and when should I use what? Like, sometimes the answer could be multiple. Sometimes the answer could be not the popular choice and, like, very specific to your use case. So how do you bring that knowledge, and how do you get those questions answered without what we call tap on the shoulder? Like, asking too many people, hey. Is this the right 1? Or, like, which which 1 should I use through Slacks or through emails and things. Right? So that is second set of problems which we see for data consumers perspective data consumer persona are the typical business analyst and data analyst personas.
3rd and interesting persona is, like, actually data leadership. So, yeah, you have a team who is creating, enabling data democratization. You have, like, the company, you have stakeholders who are creating datasets and stuff. But often, like, this too many datasets and too many metrics, dashboards created, like, a lot of inconsistencies. Right? Which is eventual problem for, like, organic growth and, like, scale. Right? Like so you have, like, inconsistent answers coming from, like, multiple places, or you don't know if you are getting the right ROI out of your data resources. Right? Like, ROI doesn't mean just, like, the cost of smoke like, Snowflake bill or anything else. Like, even in terms of you have you end up with, like, a lot of resources, like, any migration. Is it wise for my team to even spend time migrating this 1 when actually there are no end users down the dashboards? There are dashboards which are pulling the data, but no 1 is actually using it. So what's the impact? Like, so understanding helping them understand what's the right ROI, what's reducing the opportunity cost of, like, maintaining all these unnecessary data across the system? And, like, tell you, this all ties into, like, lot of productivity for your data team and, like, your company as well. So these are the 3 personas which we focus on as part of the metaphor and, like, especially the teams which are very modern data stack users are, like, our main focus, for metaphor right now.
[00:21:28] Unknown:
Digging into the actual technical implementation of what you're building at Metafor, I'm wondering if you can talk to some of the architecture, how you approached the design and implementation of it, and some of the
[00:21:45] Unknown:
open metadata conversation there a little bit. But in a nutshell, the architecture actually largely resemble what we do at LinkedIn, DataHub, but it's completely written from the ground up. Right? In fact, it's not even written in the same language as it was always written, so that's how different it is. But like I mentioned before, our goal is to build a platform on top of which you can build additional stuff. Right? So it's like any good platform. There's will be a bunch of requirements against it. And in fact, by the time this podcast get published, we would have probably published our technical blog post on to have sort of details talk about this in more in greater details about our architecture and our platform piece and how how that was built and so on. Redirect your audience to that, to read more on that. But here, I'll just roughly touch on a couple of points, which important for a metadata platform.
First of all, it's like any platform. It has to be scalable, right? The reason behind that is a lot of people overlook the scale of thinking that, hey, metadata is small, right? I can easily put that into your post grad, or something like that. Yes. That is if you're looking at a subset of the metadata. When you're actually looking at the holistically, all the metadata generated in your ecosystem, it adds up very, very quickly. Right? Imagine you're trying to capture every version of schema, every columns in your tables, every lineage. Right, every changes made to the system, every job that runs, and so on and so forth. Very quickly, that sort of thing will build up very quickly. So your system has to be scalable in order to cope with that sort of growth. And, actually, 1 sort of fundamental decision we've taken fairly early on is we wanna make sure that every metadata captured is immutable, in the sense that everything that come into the system becomes a version of itself. Right? You don't go back and change the history per se, which means we have to actually retain everything over time, right? And of course, we are very smart about not retaining version that are identical, right? You make sure that only changes, how real changes get captured per se, but still it can add up very, very quickly. So when you have that sort of scale, your system need to be capable of handling that, and it cannot just be, like, let's say, a lot of time when people build point solution, they'll choose the technology that's most appropriate for that. And a lot of time, you know, if you're thinking only about data catalog, sure, Postgres will probably do just fine in that sense. But when you're thinking about something that need to enable all sorts of different application on top, scalability, it becomes a big issue. Right? And then the second thing is reliability. Once again, right, I kind of given any platform need to be reliable.
But the reason why it need to be reliable is if you only have human accessing your metadata, sure. I mean, it can go down, I mean, silly 9 to 5 during the weekday. You don't need to have 4 nines, 5 nines, 10 nines uptime. Right? But as soon as you have system that's depending on this metadata, right, which has in the case when they have a LinkedIn, we have pipeline literally that wakes up every night and then read the metadata and delete stuff because it's GDPR GDPR compliance and whatnot. You need to have a very reliable system because if you don't, some other system will go down with you. So that's the second point. It has to be very reliable. The third point is we sort of touch on the extensibility. Right? This is also a little bit different from traditional platform. A traditional platform would say, hey. Look. I have my API. My API is kinda set in stone. And then if I need to make any changes, it's gonna be v 2, the API, and v 3 API at that point.
Here, it's actually very different because the actual metadata we are trying to capture evolve and, you know, we are trying to cap capture as rich metadata as possible. So the the API is almost constantly in flux almost in that sense. How do you make sure that it's extendable in that way and without breaking your clients, so to speak? That's also another challenge from platform perspective. And then finally, first class APIs. Right? This is also super important. We talk about this a little bit. Right? You kinda expect, yes, as a platform, you need to have REST for API, right? This almost a given today. But for metadata platform, you need more than just the RESTful API.
First of all, metadata is very, very powerful because it's connected. Right? A lot of things are related to a lot of things. So if lineage will be 1 of those variation where, you know, assets are connected to some other assets and so on and so forth. But if you think beyond that, you know, how are people related to these assets, right, and how are additional stuff, for example, you know, jobs that ran, that created this and all that. So all of these things, it forms this complex metadata graph. And a lot of time when you're trying to answer your specific use cases, pop questions and whatnot, you're initially just traversing this graph. Right? So traditional RESTful API is not very ideal for that sort of trans traversal. You want a more graph like API that allows you. So maybe GraphQL, or maybe some other graph language that are more suitable for the sort of graph oriented query.
So you need that sort of graph oriented API. And also on top of that, you also need a lot of time, you wanna enable use cases that are sort of reactive. Right? Okay. Something changed, some system should wake up and then do something. For example, if all of a sudden, the dataset, you know, drops a 100000 columns, something should happen. Maybe an alert should go off. Or maybe this thing became out of compliance. Right? Someone accidentally not supposed to access it. Something should happen. So that sort of thing, you don't want the client to busy poll your API trying to see whether something changed. Right? You want get notified, essentially. So kind of like a push based system, push based API. It's kinda expected off a metadata platform as well. And finally, there's also an analytical API.
Normally, it just means, hey. I wanna be able to make sure that this data is available to run complex analytical query and mass. Right? So give me all the dataset that fits criteria a, b, c that was used in the past 3 days and used by these group of people and, you know, blah blah blah, etcetera. You can think of that sort of very complicated analytical use cases. You don't necessarily wanna run that against your transactional database. Right? Because that's why we did evolve data warehouse in the first place. So similarly, you wanna be able to open up those API and allow people to run that sort of query against it. So that is kind of in a nutshell talking about the platform and sort of the requirements, unique requirements you place on the platform.
In terms of architecture, we essentially build this whole thing based off the best cloud could offer. Right? Which was 1 of the main reason why we were rolling this whole thing, because there's a lot of great cloud technology out there, and nobody, you know, really, you don't need to invent your own key value store that's globally distributed, and all that sort of stuff. It's already done, and you can really just leverage those and build a much more reliable system that way. So, yeah, that's kinda how it is. Hopefully that covers it. But once again, I'll say hopefully by the time that you listen to get to this podcast, the blog post is out. So read the blog post with more details. Hopefully, I'll do a much better job explaining in there than here.
[00:28:32] Unknown:
In terms of the sort of engineering work of actually getting this off the ground, you know, you mentioned that you started the business about a year ago. And so as everyone knows, this is in the middle of the pandemic, so I'm sure that that added some additional challenges. But in terms of actually getting things up and running, what have been some of the most complex or complicated engineering aspects of building this all out from scratch and some of the new edge cases that you discovered in the, you know, brand new implementation and some of the new features that you wanted to bring in to be able to support the customers of the business?
[00:29:07] Unknown:
Sure. I'll talk a little bit on the engineering side of job engine. I'll let Prudu talk a little bit on the business side as well. So we started out with a team with initially the core devolved, the initial founding engineer of Data Hub. Right? These are the folks that think they have done that for so many years. They've seen how the system evolved and all that. So they have a fairly good idea about what needs to be done. But, yes, there is a sort of a learning curve from us coming from LinkedIn, which is not really a cloud native company into this cloud native space and try to build things based on what the cloud could offer. There's a bit of learning curve there. But in terms of what needs to be done is actually reasonably straightforward, and we actually knew exactly what the problem that we hit at LinkedIn in terms of Data Hub, and we joke about we know where the skeleton is in the closet. So so it's actually it's exciting in that sense. We'd be like, hey. Look. We knew we've done something great, but we could definitely do better because we know where the problem is, so we we'll go and address those problem. So that's kind of the initial engineering challenge. And, of course, like, doing this in pandemic is the team aspect of how do you build a team, you know, a globally distributed team in this sort of challenging environment. I think that applies to everybody. Yeah. Literally, not just us. Just like, oh, like, beyond building the team, which is a totally remote situation,
[00:30:20] Unknown:
which we have not experienced before. That's that's a common problem. But, also, interesting aspects around product, the expectations from the market, around data catalogs, and, like, 1 thing which which was a different assumption for us was we thought, like, we need to do a lot of education around, like, why such problem exist and, like, why you need to build a solution for that. But, like, market rather surprised us. There is so much inbound that happened with an interest in data catalog or data discovery, and to be more specific, because I think the the all the companies which have enabled, like, the DBTs or the Lookers and the Snowflakes and Databricks, like, the data democratization has really accelerated, like, the problem to, like, much smaller companies, yeah, as well. That's that was a pleasant surprise for us. Like, even a people company with, like, a 100 people team with a very small data team have, like, similar challenges compared to, like, a 500, like, a 1000 people team, like, which was our original.
That's an interesting change in dynamics in terms of the go to market and things like that. Yeah.
[00:31:22] Unknown:
Struggling with broken pipelines, stale dashboards, missing data? If this resonates with you, you're not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world's first end to end fully automated data observability platform. In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes.
Start trusting your data with Monte Carlo today. Go to data engineering podcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo swag box. And so another interesting element of the ecosystem that you're building this in and around is the, you know, recent huge amount of focus that has been put on the idea of the modern data stack and what that might actually mean in terms of technology choice or design philosophy or technical capabilities. And I'm wondering how that sort of general focus on the modern data stack as the preferred way of building out a data platform has influenced the way that you think about building metaphor and how you think about the way that metaphor works in the situation where maybe there isn't this monolithic cloud data warehouse that the entire organization has used as a focal point for all of their data flows and to some of the expectations that people have going into using Metafor whether or not they are actually using this sort of canonical stack, whatever form that might take?
[00:33:11] Unknown:
It's actually a very interesting question that you brought up. I think the whole modern data stack, of course, the definition on that can be a little bit blurry, but that's assumed that means, hey, look, we wanna take advantage of all these you know, best in breed product that solve our data problem. So instead of relying on kind of a monolithic, you know, solution offered by 1 vendor that, yes, takes you end to end, but, you know, you sort of get locked into them, and also you sort of don't get the best in breed on everything. There's now a lot of c SaaS services, you know, cloud data warehouse, obviously, being 1 of the main main ones out there. But allows you to easily, you know, as a 1 man or 1 woman, 2 person data team to quickly bring up your data stack to reach the same sort of sophistication as maybe the biggest enterprise out there, right, with a few button clicks.
I think that that in and of itself is wonderful. And then I think what Prudu mentioned previously where we thought, hey. Look. You know, these data problem might only affect the bigger companies out there. The reason why we start seeing them in smaller company was mostly because of the modern data stack that attribute to that, because now people are sort of being empowered, given like superpower, if you will, right? Where these things only used to belong to large enterprises and biggest companies out there. So I think to a great extent, this trend will probably carry forward. I think, obviously, vendor will want to dominate and monopolize everything up and down the stack, but the reality of it is people probably prefer best and breed is easy to get their hands on, right? But then the problem there is, first of all, interpolation between these system. Right? Because, obviously, vendor generally don't have a strong interest trying to work with each other, especially if they're competitive.
So the interpolation between different systems is gonna be a challenge, a constant challenge. And second of all is we sort of touched on quite a bit before, is this whole government piece as well. Right? Now, yes, you give people superpower. Guess what? They're gonna do a lot of things. That's always not gonna happen. You make it super easy to create things and then literally free to create. And guess what? People would just keep creating more. So that is definitely where we see MetaPhoe can come in and help. And then I think we're in a unique position to come in and help, is to first of all, give you this visibility across your entire landscape, across these multiple SaaS systems that you're using, and, you know, best in breed product that you're using, give you that holistic view, because let's face it, nobody else will do it other than a company that specialize in this area.
So we'll give you that view, but not just that, but also help you to make sure that you grow your data ecosystem in a scalable way, you know, not just how much more data you can put into it, but how do you make sure that it's maintainable over long term? So we do feel like that is the true essence of you know, modern data stack. You give people the power, you truly democratize them, at the same time, you make sure there's guardrail in place as well so people don't fall off the cliff, Right? So we all serve as being that guardrail to make sure that people don't don't just shoot themselves in the food by going all in organic growth.
[00:36:17] Unknown:
I think 1 of the interesting elements of the idea of a modern data stack is that everyone assumes that that means that you're using this, you know, Snowflake or Redshift or what have you. And so all of the data that you're working with is this tabular textual data. But as we all know, if you're doing anything beyond the sort of simplest level of data workflows, you're also going to be dealing with unstructured data or data that lives in object stores or machine learning models. And I'm wondering if you can talk to some of the ways that MetaPhor and the underlying data model that it allows for gives you the opportunity to have this more holistic view of data that lives beyond the bounds of that single data warehouse and the other types of data assets that you might be working
[00:37:02] Unknown:
with? Absolutely. I think you're touching on something really important as well. So when we talk about how we model things, right, we're trying to be as generic as possible. Because who knows? Tomorrow there might be a new system x, y, z that was invented that has a very different characteristic than what you have today. But we have these sort of high level concept that is sort of portable across. For example, we view whether it's a table or maybe it's parquet file that lives in your data lake. We all view them as quote unquote dataset, right? So that's kind of a homogeneous concept of, hey, it's highly related data, right? Generally the same schema, sometimes it doesn't have to be even, but conceptually it's the same schema for a particular bunch of data that is considered as a dataset.
But beyond dataset, you're absolutely right. The value of having a metadata platform is to have all these other entity that are related to dataset, which is, sure, it's kind of the center of the universe from a data science perspective, but there's many other thing that many other people use this. Right? Of course, it's gonna be dashboard. Of course, it's gonna be machine learning model. Of course, it's gonna be feature. Of course, it's gonna be high level concept, for example, DBT. Right? What is a DBT model anyway? Right? Probably it's a table, but it doesn't have to be. Right? Actually, you don't can choose whether you wanna materialize into a table or not, right? So that sort of stuff has to be all captured in the sort of entity relationship graph, if you will. So when people start looking at it, they can truly understand how things are related. And that, once again, sort of echoing back what we mentioned before. Right? When you go down that path, then every entity that you talk about in your entity graph will have a bunch of different kind of very rich metadata attributes, if you will. Right? Because depending on whether it's a ratchet table, it might have partition key or whatever, and when you have, you know, if it's unstructured data, well, there's gonna be a whole bunch of stuff related to that as well. So you need to have the basic anchoring entities in your system and say, hey. Look. That's how we conceptually think about these things, but that doesn't mean that this thing can only take 1 form. Right? It can take a lot of different forms, but conceptually, they are considered as a dataset. It's just like different version of dataset, almost like an object oriented sort of way of thinking, but less rigid. Right? It's not like, okay, you have to have this class and that class, extend that. More like, hey, there's a famous saying in software engineering, right, Composition over inheritance, right? Do you wanna be able to do composition rather than this strict inheritance sort of system? So that's how we look at it. And in fact, we actually model people as an entity in there as well, because we really think people is a very critical part of the system.
Who's using what, who produce what, who's a subject matter expert of what, who's in charge of this, and all that. All that sort of information, you have to have it as part of your ecosystem, because guess what? When people are dealing with the thing, they wanna talk to a person. They don't wanna talk to a machine. They don't wanna be able to find the right person to talk to and ask for help and whatnot. So so people, absolutely, another entity that sort of in the graph as well. In terms of the actual people aspects of the metadata system, 1 of the interesting layers that you've built in for MetaPhor is this
[00:40:05] Unknown:
integration with the communication systems that an organization might use where you can use it to either query your metadata graph from Slack or be able to pull in conversations that have happened in these different platforms and attach that to a particular set of data assets. And I'm wondering if you can talk to some of the capabilities that you're building there and some of the types of workflows that you envision as you were building out those capabilities and some of the kind of details around how that manifests?
[00:40:36] Unknown:
It's a very interesting topic. Right? Like, 1 of our customer is famously quoted. Like, the best data catalog is the 1 which you really don't have to go to, which if you see, what does it mean? Like, really, the whole essence around data catalog is a web UI, and, like, you need to go there and, like, start your data workflow or, like, understand which data or, like, even to include, such in discovery or any use cases. Right? Like like, a lot of our competition are like like, in our past, we have done, like, the similar approach. And as we start thinking more and more about metadata as a platform and, like, metadata applications embedded in, like, all data products, we focus a lot more on embedded workflows.
So what does this mean? Like, take an example of, like, Slack, which is 1 of the most common communication channel across our customers and can represent, like, other other tools as well. Right? Similar tools. I would argue that it's actually Slack these days an essential data product. The reason being, like, lot of conversations around data between all the folks in the company actually happen in the Slack. Lot of decisions, information get exchanged. A lot of knowledge gets around data gets built into the Slack. Right? So that's why we essentially believed in our user should be able to kind of consume and, like, create anything or, like, interact with the data within the Slack itself as an embedded workflow, but then explicitly logging into our web UI or, like, interacting with our web UI explicitly.
So example would be, like, you are conversing with someone in the data team or your colleague about a dataset or through a metaphor Slack app. You can really search for any data assets and things within the Slack and, like, share it right away there. And, also, conversely, when you think that some you have exchanged, like, some important information about this dataset, be it a dashboard or things, with a click of a button, you can push that knowledge to your data catalog or data platform and process there so that, like, it's available for rest of the people and rest of the ecosystem as well. So these are, like, some cool new features which we built inside, through a powerful Slack app, not just, like, simple notifications and tagging the conversations, etcetera. Yeah.
[00:42:45] Unknown:
In terms of the overall workflow of actually onboarding onto MetaPhor, integrating the communication systems, the data systems, populating the, you know, data lineage graph. I'm wondering if you can talk about some of the ways that you have designed the product to simplify that, maybe any specific client libraries that you've built to ease the integration process from the technical layer and just the overall workflow of deciding Metafor is the platform that I want to use. Now I've got all of my metadata living in Metafor. I have constant updates feeding in through, and I want to build other applications on top of that metadata graph that I've populated.
[00:43:23] Unknown:
This is actually an area we spend a lot of thought in. Very early on, we realized the whole metadata thing is a integration game. Right? The easier you make it, the harder to integrate, the more adoption you end up getting. So we really thought hard about how do we make this as easy as possible. Sure. On the 1 extreme, you can have, hey, everything is fully managed, which we do as well. Right? Okay. All you have to do is enter your credential, your Snowflake credential, your whatever system credential, look at credential and whatnot. Okay? We can go and connect to those system and grab metadata for you on your behalf. And a lot of company prefer that that way because they wanna have as little, you know, thing to manage as possible. Right? So they want it to be fully managed. But as soon as you start talking to larger companies, they get a little bit nervous about that. There's definitely security consideration.
There's also consideration saying that, hey, look, we have our own custom system, right, for example. Right? Or maybe this is a pipeline that we run inside. Maybe it's our CICD that is doing these things and all that. So you cannot fully manage this for us, right? It doesn't make any sense for you to fully manage this stuff for us. So how are we as a company able to just push the metadata to you, and then have you, you know, ingest into your system and start surfacing up in your platform? So for that, yeah, absolutely, like you mentioned before, we do provide essentially like Python library and script that people can run against system that we support out of the box. But for system that we do not support out of the box, it's also super easy for them to leverage a library to build something on their own. And literally what end up happening at the end of the day is for them to, you know, extract the metadata from the system, whichever way they feel they'd be easy for them to extract those metadata.
At the end of the day, just write it to a file, and send put that file on s 3 bucket or GCS or something like that. Right? And then we'll pick it up from the other side. It'll be as easy as that. Everything will be automated. Everything is essentially via real time. As soon as the file is written, we'll we'll ingest it. You know, we try to make it as easy as that. So even if there's a system proprietary system that they develop in house
[00:45:24] Unknown:
that we have no access to, they can still easily integrate with us. And so for people who are going down that path of integrating with MetaPhor and building systems on top of it, what are some of the challenges that you've seen teams face in terms of being able to stitch together a meaningful set of relationships across these metadata entities where I've populated the database schema for my application database. I also have the, you know, job IDs and task timings for my airflow jobs that are landing in the data warehouse and then being able to actually build a cohesive view of saying, okay. So this job is taking these source tables, landing it in this destination, and then triggering this dbt pipeline.
[00:46:06] Unknown:
Yeah. Actually, this is also a very super interesting area. We joke about how the hardest bit there is none of the the code that you have to write. The hardest bit there is actually to agree on the ID. Because you kind of have to have this global ID to say when you say, hey. This is the job. This is the, you know, the table. This is the column. You need to have a way of referring that. Right? And then everybody need to agree on that. Otherwise, things won't link up. It'll be as easy as that. But then the tricky bit is these ID generally is kind of hierarchical in nature. Right? If you think about it, if you're talking about a table in Redshift, okay, you can say, Sure, I can start with a table, table name. But then, of course, you say, hey. Wait a second. You probably need to include a schema and database as well to make it globally unique. You say, okay. Fine. I'll add those. But then the next time you say, hey. Look. What if you had more than 1 Redshift cluster? Okay. Then you have to add more than 1. But wait a second. What if you deploy it to multiple region in that database, right, at that point? You can see that there'll be multiple layers of things that sort of uniquely identify this thing. But then at the same time, you cannot make it all mandatory. Right? Because if you make all of that mandatory, guess what? Like, people say, I only have 1 cluster in 1 region. Why should I fill out all these information that are not necessary? So coming up with an IDing system that is extendable so you can cope with the most complicated system, like multi region, multi cluster sort of scenario. But at the same time, you won't burden the people who actually is the majority that doesn't have complicated deployment.
How do you design an IT system that works in both cases is actually a very interesting engineering challenge. And after all, once you develop that, if your IT system ever changed, right, for example, okay, now I actually do have the second cluster. How do I make sure that this IT system can migrate? It's also yet in the south of itself is not a challenge as well. Those are super interesting engineering challenges that we face every day. And for better for worse, we have, as the team that provide the metadata platform, you kind of have to put some rules in place when it comes to this sort of thing. But you just wanna make sure your rule is flexible enough so when people do actually have different configuration, and you don't force them always to be as fine grained as they could possibly be because that'll be kind of counterintuitive in most cases.
[00:48:14] Unknown:
For people who are starting to use Metafor and explore its capabilities and build new systems that tie into it, what are some of the features or potential use cases that are either overlooked or misunderstood, and some of the things that you'd like to draw particular attention to for people who are considering MetaPhor for their use case?
[00:48:33] Unknown:
Another interesting question. So, like, I I think we lightly touched upon this in the previous conversations. Right? Like, not just metaphor, like, a lot of common, misconception or, like, misunderstanding around data catalogs was, like, the assumption that, like, pulling all the technical data together is going to solve the data discovery problem is in itself is the is the problem to start with. Right? So we have a stand experience trying that out. Like, then there are limitations. Like, that is definitely a step towards solving the problem and natural thought process that, like, hey. I cannot find it because I don't have it all in 1 place, and it might be a search relevance problem or, like, let me just, like, add as many signals including into 1 place. So it it is, in fact, not a search relevance problem. Right? Like, it is really hard just to use, like, relevance to identify what you are looking for. Right? Like, that's why we believe, like, catalog is a part of it, or to be very specific, technical metadata, catalog is as part of the whole solution or the problem. But, essentially, you need lot way different signals around how is your data what is your organizational challenges and how you use the data or what is the business context you are actually using this data for, or who is involved, and, like, do all these aspects come together to kind of, like, help you understand what is the right data for your use case. Right? Like, our all the use case what we talked about. That's why we we believe that there are, like, 3 components, technical metadata and business metadata. And, like, actually, the user metadata usage mesh like, context, which we need to bring it together to kind of solve this problem holistically.
And we did a blog post around this, like, why can't I find the right data? Which came out of, like, all the frustrations, what we heard from our customers in the past, like, in the last few years and, like, what we are essentially trying to solve with MetaPhar.
[00:50:21] Unknown:
There's also a funny story I'll interject there as well. So and in fact, we hear this multiple variation. Right? When people say, hey. Look. How do I know I can trust the data? Guess what? I'm not looking at the data quality. I'm not looking at the lineage. I'm not even looking at the SQL query or anything like that. I'm looking at who's using it, who's depending on it, and if I trust that person. It might sound silly, but that's actually very true, right? Because if you think about it, a lot of time you will trust the person, right? If I know this dashboard is used as SEO, I can trust it a lot. Right? Because I know someone will make sure that it happens. It's up to date and it's accurate and all that. Right? Where if it's not used by anyone at all or not used by anyone I trust, naturally, you have less trust towards it. That's part of it. But that's basically how this whole, you know, behavior metadata, as well as, you know, business metadata come into play. So you cannot just throw all the technical stuff at the people and expect them able to magically digest all that and come up with their own enlightenment. I think that's not how it works in real life.
[00:51:16] Unknown:
And as you have been building the business and working with customers and design partners, what are some of the, I guess, surprising aspects of the usage of metadata or the level of general sophistication or awareness of its capabilities or some of the potential adjacent use cases that you might start to explore for this metadata platform?
[00:51:39] Unknown:
I would rather say, like, I think I also mentioned this, like, 1 of the surprising thing, 2 levels. Like, 1 is the scale of the company. Right? Like, we were always thinking that this is, like, a 1,000 plus people kind of a problem, and we were only targeting and talking to those people. But lot of inbound happened to be, like, from, like, very small companies, like, 100 people. And when we take a deeper look, yes, they do have same set of problems. This is this is very interesting thing. And it is very grammatical structure to the set of problems what they have. Like, yes, finding the data is 1 thing, but in terms of governance, they're, like, a lot of advanced use cases in terms of I use a DBT or, like, a kind of a thing to kind of, like, govern, like, the creation of the stuff.
But once I have downstream dependencies on things, can I enforce rules on top of, like, my creation to make sure that those valid usage consumption patterns are validated? Right? Like, in the sense, if if a if a dashboard user says that, like, this cannot drop this column, then that has to be part of your check, like, get push check to make sure that the consumption patterns are also, like, validated before you actually do the change. Right? These are some of the things and some of the interesting use cases which popped up around, like, ROI, not like which we always thought it's a cost problem, like, our resource like, billing problem, like, huge bills and things, but it is not just that. Like, lot of opportunity cost for the data teams is huge, which is essentially, like, talking about, like, what's ROI on my data. Right? Yeah.
[00:53:06] Unknown:
Recently, like, both Snowflake and Databricks start pushing hard on their data marketplace. Right? And we do hear people start talking about this sort of along that line. Right? Hey. We spending money buying these data. How do I know it's actually give us the ROI that we actually wanted to know, we wanted to achieve essentially. Right? So that's a little question. If you don't know how the data was used, you cannot answer those questions. Right? Maybe there's a dashboard that's powered by this data, but is this an important dashboard? Is it not? I mean, does that actually lead to important business decision or not? That sort of question, you know, becomes very, very interesting. As soon as company really start spending a lot of money buying data from our side, they wanna know whether there's an actual ROI, the the right kind of ROI they want.
[00:53:47] Unknown:
Conversely, for a lot of companies, this is not a problem, like, only internal, like, how do I enable data discovery within the company? But, like, if you think about it, do I need a separate data marketplace catalog, or can I just, like, leverage the same catalog with an external view versus a small view and expose this data for, like, availability or, like, try selling this data for my customers? Right? Like, this is an interesting us coming towards us from our customers as well. As you have been working with people and building out the platform, what is the most interesting or unexpected or innovative ways that you've seen MetaPhor used? As part of our system, we actually stress a lot on knowledge capture to a great extent, right, data relevant knowledge that we want people to capture.
[00:54:29] Unknown:
And these are actually being used in many different creative way, right? Some people use it as a way of, hey, here's our onboarding guide, new team member, right? Welcome to the team. We are the 3 things that we use in terms of data. We are the 5 dashboard you need to pay attention to. Oh, by the way, we are the 2 person that you have to work with on in the system team that they happen to know much about this. Right? That sort of thing actually is a good way to document that in the context of the data. Right? So our system happened to be very good at essentially, you know, linking all these things together in a nice, for lack of better word, Notion like sort of document. Right?
So having that ability, people start using it for, like, sort of onboarding guide. And, of course, you can extend that to kinda Wiki page as well. Right? So something to create. Hey. Look. Here here's our team's Wiki page. Here are the 5 things we work on mostly. And, you know, if you have question about this, you know, go and ask this team and all that. There's definitely that sort of piece as well. And finally, there's also like, some people create essentially kind of my personal favorite playlist sort of thing, right, basically say, hey, these are the 5 things that I work on a lot, or when I do this project, these are the 4 different things I use on use a lot. And then sort of share that across. Right? Basically telling people, hey. Look. If you happen to work on this project, here are the, you know, 5 things and 4 things you should pay attention to and all that. So those are kind of the creative way people are using the knowledge capture thing in order to surface all that information. The more expected way, for example, 1 thing that we hear from our users say, hey, look, I'm a data engineer. Right? I'm in charge of the thing, and I need to deprecate the table, or I need to migrate it to something else. And, yes, I look at the lineage. I realize there's a hundred thing that this thing depends on, depends on this this particular table, right? But I don't know if any of those are really useful or being critical, right? I can look at the access log and then say, yes, well, this headless account that's accessing every day, so what? Right? I mean, because someone said on the pipeline, of course, there's gonna be headless account accessing.
How do I know if this is actually business critical, right? And then you kind of have to rely on people telling you that, it's very hard to somewhat magically derive that. And in fact, most of the thing we hear is people just turn off access and see who cries. Right? That's kind of the typical case. Let me turn it off for a couple days and see what breaks, and if nothing breaks, then I can maybe kill it off at that point. And this actually happened in our LinkedIn before as well where they turn it off 30 days, nothing happens, nothing breaks, all good. Turn it off, next day it breaks because there's a quarterly report that runs every quarter, and and then that breaks. Right? So, I mean, you know, you can argue that if I extend it to a year, maybe that's okay. Then at that point, maybe the migration's kind of point is muted at that point. Right? So having the ability to know who's using it and using it for what purpose becomes super important when it comes to deprecation.
And, of course, they can always check with the person. Hey. Is this still true? Are you still using this and all that? And if the answer is no, then maybe it's safe to deprecate at that point. So there is that kind of expected usage of knowledge capture, but then suddenly there's also a bunch of unexpected usage. In terms of your own experience and lessons learned, building metaphor, what are some of the most interesting or unexpected or challenging lessons that you've come across? I think we sort of covered that on that part. I think the whole modern data stack sort of just bring a lot of these problem to smaller companies a lot quicker than we have before. And I think, you know, the sort of the underlying thing is more and more company become more data driven. Right? I think in in certain industry, for example, ecommerce, now if you're not data driven, it's almost, you know, you're not gonna be able to compete with your competitor straight right out of it. So because of that, nature of that, a lot of these problem that we thought, quote unquote, big company problem, because we work at LinkedIn, right, and then that's what scale, yes, absolutely, we do have that problem. When we come out and we talk to people, we realize, oh no, it's not true anymore because a lot of these small company, they gained their superpowers through modern day a second, and now they're having all these problems. So that was definitely very surprising fun fun for us because initially we saw, hey. We have to work with big enterprises, right, the banks, insurance company, and all that, in order to survive because only those are the company that care about these things. And it turns out that's completely not true. Smaller company actually care about these things a lot more.
[00:58:31] Unknown:
For people who are trying to gain better visibility into their overall data ecosystem and data platforms that they're building? What are some of the cases where MetaPhor is the wrong choice and they might be better suited by some in house system that they've built or some other data catalog or whatever other sort of set of tooling that they might decide to cobble together.
[00:58:51] Unknown:
I think we've demystified, like, company size is not really an indication. Like, it is really about actually your data maturity. Right? Like, stack maturity. To be more specific, if you have adopted modern data stack in on the path for it, like, the earlier you engage with a modern metadata platform is more beneficial rather than, like, introducing
[00:59:11] Unknown:
later into the game. And if you are, like, anywhere far like, we have also, like, seen cases where they're not cloud native or they're nowhere close to, like, a modern data stack, then maybe a traditional data catalog or an enterprise, like, a catalog who could be a better better choice, like an old school catalog. I think the bottom line is we were strongly discourage anyone to build catalog of their own. We've seen too many cases, too many company try that because it might seem like an easy thing to do. Right? I I think no 1 other than a handful of companies actually building cloud data warehouse. Like, everybody seems to be building a catalog of their own. The reality of it, it's a lot more complicated. Right? Like I mentioned before, sure, you can easily do a point solution. Right? You you easily throw some Flask that shows some static pages and call it a day. But your use case will evolve very quickly. You realize that it's not enough, and you stroll more and more engineering hours into this. Right? And if you have plenty of engineering hours, sure, feel free to do that. But I think most company will treasure their engineering hour very dearly and try to work on something that is their core business.
So we would definitely discourage people to do that. We've done that before. Not fun, and you will probably end up spending more in the longer term than you think you will.
[01:00:18] Unknown:
And as you continue to build out the MetaPhor platform and work with customers and grow the company, what are some of the things you have planned for the near to medium term or any projects that you're particularly excited to dig into?
[01:00:30] Unknown:
Like we mentioned, definitely, the mission of the company is to build the metadata platform, which can power, like, multiple use cases rather than, like, building this point to point solution. So I like and data discovery and, like, parts of governance happen to be the very initial foot towards that. And soon, we are going to enable a lot more use cases around change management or resource optimizations, which are very exciting exciting use cases on their own. And that's a different dimension. Some of our customers are also interested in extending these capabilities to, like, other assets like ML models and features.
And it just send to the datasets and dashboards what we have. That's an interesting area which we will be working as well pretty soon.
[01:01:13] Unknown:
Are there any other aspects of the work that you're doing at MetaPhor or the overall space of metadata platforms and its applications to the data ecosystem that we didn't discuss yet that you'd like to cover before we close out the show?
[01:01:25] Unknown:
No. I think we pretty much covered a lot of this time. Like, the only aspect, like, hey. You have so much of rich metadata with you. What are the ass insights you're gonna drive on, like, how much beyond automation, like, how much predictive aspects can you build on top of it? Like, hey. Looks like this change could affect you or rather than, like, enabling only debugging or root cause analysis, like, being intelligent about notifications or being intelligent about identifying a problem and helping out proactively, to the customer. It's an interesting space as well, like, as we start accumulating a lot of this important metadata.
[01:02:01] Unknown:
And I think also the privacy angle of this is also we recognize it's almost unavoidable. That is a big driving forces for a lot of company. You know, obviously, GDP, ICCPA, and a whole bunch of other localized version of them is coming out everywhere. So we definitely hear a lot of those use cases as well. I don't think we will be getting into that in the short term per se, but I think the catalog does play a big part of that, at least as kind of a portal into how to control that sort of thing and whatnot. Not necessarily actually enforcing 1 of the compliance requirements, but but definitely has a strong angle in sort of surfacing that information up to the unit. So that is something that I think we will have to get into sooner or later. Just that right now, our priority is mostly focusing on the discovery and governance piece.
[01:02:46] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're each doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:03:01] Unknown:
There's a very interesting blog post by Ben Stencil on data experience. I think there's a couple variation of it. It's data experiences, data OS, and all that. I think he summed it up pretty well in his blog. You know, we have great tools, no doubt. There's tons and tons of great tools that are very, very specialized and very, very good at what they're doing. But what's really lacking is a cohesive experience. And that is why data, despite things, hey. Look. You know, we're trying to democratize and all that, still kind of being used mostly by technical people. Right? Business people are still reasonably shying away from using it or just dabbling on the surface of it. So that is not true democratization. Right? I mean, the true democratization is a place where, you know, data serve the main purpose, which is improving business, and business people are very happy to use data on a very regular basis. That is the piece that we think we are also trying to solve, and we have a good shot at solving it because of our sort of cross cutting nature of the area we're in. So, yeah, we're definitely hoping that we can bring out a product where, you know, not just that you have the best in breed of all these infra technology, but also the best in breed experience that enable everyone to truly democratize your platform. And so
[01:04:16] Unknown:
Alright. Well, thank you both very much for taking the time today to join me and share the work that you're doing at MetaPhor. It's definitely exciting to see another entrant to the space and something that is aiming to be a more broad based application of metadata and its use throughout the overall data platform. So definitely appreciate the time and energy that you're each putting into that and excited to see where you take the business. So thank you again for your time, and I hope you enjoy the rest of your day. Thanks for having us. Thanks, Tobias. For listening.
Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on iTunes and tell your friends and
[01:05:21] Unknown:
coworkers.
Introduction and Sponsor Messages
Guest Introduction: Pardoo Gunnam and Mars Lan
The Evolution of Data Metrics and Metadata at LinkedIn
Building Metaphor Data: Origins and Vision
System of Record vs. Data Catalog
Open Metadata and Metadata Modeling
Target Users and Personas for Metaphor
Technical Architecture and Scalability
Engineering Challenges and Pandemic Impact
Modern Data Stack and Integration Challenges
Handling Unstructured Data and Metadata Modeling
Integrating Communication Systems and Embedded Workflows
Onboarding and Integration with Metaphor
Common Misconceptions and Use Cases
Surprising Use Cases and Customer Insights
Knowledge Capture and Deprecation Workflows
Lessons Learned and Market Surprises
Future Plans and Exciting Projects
Privacy and Compliance Considerations
Final Thoughts and Contact Information