Summary
The interfaces and design cues that a tool offers can have a massive impact on who is able to use it and the tasks that they are able to perform. With an eye to making data workflows more accessible to everyone in an organization Raj Bains and his team at Prophecy designed a powerful and extensible low-code platform that lets technical and non-technical users scale data flows without forcing everyone into the same layers of abstraction. In this episode he explores the tension between code-first and no-code utilities and how he is working to balance the strengths without falling prey to their shortcomings.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan.
- Your host is Tobias Macey and today I’m interviewing Raj Bains about how improving the user experience for data tools can make your work as a data engineer better and easier
Interview
- Introduction
- How did you get involved in the area of data management?
- What are the broad categories of data tool designs that are available currently and how does that impact what is possible with them?
- What are the points of friction that are introduced by the tools?
- Can you share some of the types of workarounds or wasted effort that are made necessary by those design elements?
- What are the core design principles that you have built into Prophecy to address these shortcomings?
- How do those user experience changes improve the quality and speed of work for data engineers?
- How has the Prophecy platform changed since we last spoke almost a year ago?
- What are the tradeoffs of low code systems for productivity vs. flexibility and creativity?
- What are the most interesting, innovative, or unexpected approaches to developer experience that you have seen for data tools?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on user experience optimization for data tooling at Prophecy?
- When is it more important to optimize for computational efficiency over developer productivity?
- What do you have planned for the future of Prophecy?
Contact Info
- @_raj_bains on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- Prophecy
- CUDA
- Clustrix
- Hortonworks
- Apache Hive
- Compilerworks
- Airflow
- Databricks
- Fivetran
- Airbyte
- Streamsets
- Change Data Capture
- Apache Pig
- Spark
- Scala
- Ab Initio
- Type 2 Slowly Changing Dimensions
- AWS Deequ
- Matillion
- Prophecy SaaS
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Outland as an internal tool for themselves. Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm welcoming back Raj Bains to talk about how improving the user experience for data tools can make your work as a data engineer better and easier. So, Raj, can you start by introducing yourself for anybody who hasn't listened to your previous episode?
[00:02:11] Unknown:
Thanks, Tobias. I'm delighted to be here. I'm Raj Bains and, the founder of Prophecy. It's a local data engineering platform. I come before this from a background in engineering, primarily compilers and various parts of the data stack, and now bringing that kind of power to the visual tools for managing data. And do you remember how you first got started working in data? So I was working first in compilers, which was in Microsoft Visual Studio, some in Microsoft Research. And then I moved to CUDA at NVIDIA, when we we I was kind of in the early team or you could say the founding team as we call it these days, building that. But after that, I wanted to build my own company. Compilers is not a space where you could start start ups. So I was like, okay. What's a related space?
A lot of challenges were happening in data. Data was exploding. So I'm like, okay. Let me move into data. So I moved actually first into an operational database company that was building distributed databases called Clustrics. I went there as an engineer, worked on building distributed aggregates, worked on the query optimizer. And then from there, the problem was repeatable product market fit. You know, they had a great product, but not quite a scalable go to market motion. So from there, I moved into product marketing, product management, trying to find the fit and that kind of movement. But that was still in the operational data space.
Then from there, I moved on to Hortonworks where I manage Hive, and Apache Hive was the primary product in the Hadoop stack. And through that, I got to understand the other part of data, which is, you know, where I guess, the most of the action is, which is analytical databases. And really got to understand the data engineering, the data analytics, you know, even data going into machine learning and the challenges that various companies had. So, definitely, it was, you know, pretty much saying what my current skill set is. With that, if I have to go into a start up space, it's very opportunistic. And then they're kind of using compiler as a thread to move from, well, you know, c type compilers to database compilers to understanding the data space and now into power tools for data.
[00:04:24] Unknown:
In the compiler aspect, it's definitely very interesting and, I guess, underappreciated area of computer science where if you look at it the right way, most things can be broken down into the different stages of compiling, whether it's, you know, compiling code or transpiling. I actually had an interesting interview a little while ago with the CTO at Compiler Works to talk about their approach to using compilers for everything. And I know that you have a similar approach of being able to take some of that compiler logic to say, here's the high level goal. I'm actually now going to use that to generate the actual code that's going to run so that you don't have to be the person who's, you know, writing all of the verbose syntax to make sure that all of your Spark jobs are going to be fault tolerant and scalable.
[00:05:09] Unknown:
Yes. Definitely. Compilers remains, you know, at the heart of our product. And it's amazing how powerful it is and how many things we can do. I mean, going from, like because we are strong in compilers, we are able to take some of the legacy ETL products, move their codes into our product. Right? And to our and then in a sense, like, when we developed our product, right, you have low code, which is visual. You have code. But then we have our own intermediate representation of a data pipeline. And now, you know, you come from a format, you know, you can go to PySpark. You can go to Scala. You can go to visual. And once you are in this compiler world, all the data logic kind of, you know, can quickly move between these various things, and you'll have this much deeper understanding of what your user is doing. You know, to a certain degree, it was just luck, but ended up being a great fit for this area.
[00:06:00] Unknown:
And so in terms of the aspect of the interface that is presented for the different data tooling, you mentioned that prophecy is a low code tool to make it easier for people to be able to quickly wire different pieces together. Some people might prefer the, you know, very code heavy approach of, I want to actually write all of the logic and fine tune it and be, you know, deep in the guts of the systems. And then there are some of the more kind of vertically integrated where you just say, you know, we're going to provide you with best practices. You just tell us which, you know, prepackaged option you want. And I'm wondering if you can just talk through what you see as being the broad categories of data tools and the designs that they adopt and some of the ways that that impacts what is possible to build with them. Definitely.
[00:06:53] Unknown:
I will maybe start 1 layer lower in the stack and then build up to it. So if I'm doing data engineering, the first thing is there is I need to run it at a particular time. So you have the scheduler, orchestrator layer. Right? And why I'm mentioning that is sometimes, you know, when we went to raise our series a, you know, some of our investors were not even able to differentiate between orchestrators and ETL tools. So Tishinder, so I'm kind of going to go in the you know, kind of break it up. So you have the schedulers and orchestrators, which is, like, using something like airflow and then using maybe Databricks jobs if you're using Databricks.
And all the legacy ETL tools had, you know, an inbuilt scheduler, and then you had the enterprise wide Now they are saying, okay. Run the job at this time. Run this, then that. Right? And then you have the data engines. Right? You have the spark. You have data warehouses and then, you know, the data lake query engines. So that is your processing. Again, comparing to the legacy tier 2, some of them had the whole thing in them. Right? And now it's kind of broken up. And now comes more of the tooling. Now the question is, but what about metadata? That, again, belonged in that same package, but today, there is some start ups on the operational side around, like, LinkedIn Data Hub or those. And then there is some on the consumption side, right, where you have Calibra and those people who are more in the data warehouse analyst interface.
Then there is observability again, which used to be part of the same ETL package. And now you have Excel data and other companies who are focusing on, you know, end to end observability of your data. So these are all there. But all of these are still infrastructure, but not what you use to get your job done. Now, you have to actually get your job done. So you're going to say, okay, I'm going to use some paradigm. Maybe I'm going to write scripts. Maybe I'm going to use a visual tool. So that that kind of starts to break down things into categories. So first is the category of point to point data motion. Right? It's like I have a bunch of systems. I just want to ingest data.
So there, you are going to get your Fivetran, Airbyte, HEVO. So these are kind of, you know, if I want to move data from Salesforce to my data warehouse, I just connect the pipe. It's going to land the data in a preset format and then going to update it. Right? So change data capture. So this is just the ingestion piece. And they are kind of saying, I'm gonna solve the connector problem, get data into your systems, make life easier. And then, also, there is a counterpart for the on premise operational systems and getting that data into the s 3 cloud. So there's, like, precisely stream sets. They do that. So these are the CDC tools. Now once you've done CDC, you have your source data from operational systems.
Now you've got to do the transforms. So transforms then have, you know, 2 ways of doing it. You can write your own code. So you are writing PySpark or you're writing SQL. Or sometimes SQL doesn't always capture all the use cases, or you are using a low code approach where you would use Matillion or prophecy. Right? And we are kind of different tools in that, different ways of doing it, but at least Matillion and prophecy broadly, you could say are in the low code category, though we are very different. And then once you've done your data transformation, your data is inside the data warehouse, you want to write some SQL queries to get a regular report or something. You can use, you know, raw SQL. You can use something like DBT, which helps you break down SQL.
And any of these can do they do some light ETL, and they do. And then they do some mostly, like, reports and then dimensional modeling and that kind of stuff. So these are kind of the set of tools that you can use on top of your data platforms, which is like the data breaks, the data warehouses, you know, to actually create your business pipelines. So I guess this completes the broad categories of the tools. And then the second part of your question was how does that impact what's possible with them? The point to point data motion, right, that's all you can do. Right? So that they are great. It's a specific use case, specific solution, and it solves it well. So they are great. Like, the 5 trans of the world, great products. Now the next thing is when you are building data pipeline, you know, you are kind of deciding between the local options, which is the coding option. And there, you know, low code options give you more reuse, more productivity.
And then if you go with the visual ETR tool, a lot less is possible than you could do in code. If you're going with prophecy low code, we are kind of merging the 2 approaches, saying the old visual ETL or the material kind of ETL with code and the best of both worlds. And we can get later into it. But, like, we believe we can do everything and better. But in this stage, right, it's efficiency is important. Scale is important. You know? And, you know, reusability of business logic is important. Lineage, search, basically being able to understand what your transforms are doing. Because on the other side, there's a data analyst consuming the data. They've got to understand what's going on. Right? Governance is important. Right? Can you track PIA data?
So in this middle bucket of transformations, if you kind of roll your own scripts, which is what most of the industry is doing, you don't have lineage often. You don't have search. You kind of give up on governance. Right? Your consumers, the people doing analytics, they don't understand how any value was computed. Right? Your metadata architect cannot understand where PII values are ending up. The code is severe, like, very limiting in terms of what it offers. Right? Apart from just the productivity of it, it's not really offering its customers, the data analyst, with basic information or the data engineers with these deep insights.
So that's the kind of comparisons here. And then if you go back to dimensional modeling, right, yes. So DBT would do that well. On the other hand, some people use DBT for ETL also, but then you can use it at a smaller scale because there you will not be able to do business logic reuse or stuff like that. Right? For us, our customers are like, I want reusable subgraphs. I'm ingesting data from 10 vendors. I want to run a set of rules on them. It's like, well, if you're writing a SQL script, there is no reuse. Right? But that is for if you're doing dimensional modeling, then that's the right solution. Right? If you don't need reuse, you don't need scale. So these are kind of the, I guess, the broad options in tooling and what's possible with each of them and how they compare
[00:13:39] Unknown:
at at a high level. To the point of the kind of productivity gains along those different axes in terms of writing the code directly, you get a lot more flexibility to make you productive at the kind of level of expressivity, but it reduces the organizational productivity. Because as you said, you have the, you know, very detailed code, but it's not in a format that's accessible to the downstream consumers of it unless you add in all of this additional engineering effort to wire all these pieces together. And so I'm wondering if you could just talk to that kind of organizational aspect of productivity by streamlining and kind of polishing the tool chains that the different stakeholders and data producers and consumers are working with to make it so that they don't have to add in all this extra effort just to be able to enable the collaboration aspects of these, you know, data workflows.
[00:14:37] Unknown:
This is a great point, and I'd like to cover it in 2 pieces. Right? 1 is for the data engineering themselves, And the second is organizationally. And that also has an organizational component. And then for the larger organization. So first, for data engineering itself. Right? There is but do these what are the challenges of the organization? The central data platform team might have a few, you know, people who are experts in Spark, and then they're oversubscribed. So now anybody trying to build a team of more data engineers or hiring some younger people out of school, You know how hard it is to hire data engineers. And, you know, as they are trying to build that organization, getting the set of people who could be productive with this code is hard. So just setting up that organization to be able to deliver on its mandate is hard.
Number 2 is when they are actually working on that. Right? There is a huge amount of churn in data engineers. That's right. They've worked for 2 years here, 2 years there. Generally, the industry, you know, you know, when they come in, can they quickly understand your code? So the you know, when they come in, can they quickly understand your code? So the organizational challenge there is, like, you get these new people. They can't understand the code. Or when you you have 2 people who can write spark code really well, and then you get another 3 in the team who are learning, and overall, they are not productive. Right? And then there is basic things within data engineering. It's like, okay. I'm going to change.
Let's say I'm writing a data pipeline as a data engineer, and I'm going to change a column. Right? What is the downstream impact of that? Am I gonna break 2 pipelines? But how I find out is I make the change and then somebody screams at me. Right? Or, you know, if your pipeline breaks in production, like, some column value just comes out to be wrong, now you've got to track where it came from. It might be a chain of 10 data pipelines that it came through. Right? And now how are you going who can chase it back across all, you know, multiple piece pipelines written and get in different writing styles by different teams. Right? And figure out where the problem is to quickly fix it. So just saying that, fundamentally, just within the data engineering team, code is a suboptimal approach. Right? It causes a whole lot of problems, and we don't think it's a good solution in general. But then the second thing that you said, right, what's happening a lot that we are seeing in the industry is that the data analyst. Right? So if you look at it, what is the job of data engineering?
It is to provide data analysts or business intelligence and machine learning engineers with data that they understand that is clean and good format, and they understand how it was computed. So there is so for example, there is a column called internal rate of return. Right? Some engineer data engineer was feeling a little sleepy. They Googled for the formula for internal rate of return. 5 results came up. They picked the third 1, coded it up. That might be different from what the understanding of the business is. Right? So organizationally, what's happening is the line of business team or the team that has to actually deliver on analytics often cannot understand how a value was computed even if the data engineers provide it. Number 2 is often they are waiting on the data platform team to provide them with something kind of really holding them back. And to that, right, there is a big reaction which you could call modern data stack, which could be called the data platform team bypass stack.
Right? It's like data analysts are like, I'm fed up with the data platform team. They keep wrestling with the infrastructure and not producing what we need. Let's just bypass them, build the second set of tools. And so, organizationally, the challenge is, you know, the analytics and machine learning teams are not getting the data. When they get the data, they don't understand what it is. And I was talking to 1 of the companies, somebody from a data platform team of 1 of the top retailers. And the top retailers is, I guess, shoes retailer, you know, top 1. And they're like, you know, we enable all our data engineers to write whatever code they want. And then I'm like, okay. Are your data analysts enabled to understand, you know, the data consumed? And they're like, no. They are not enabled.
But they are your customers, and they're like, yeah. We know. And then they asked us. Right? They're like, what do you think happens with us? And we're like, we don't know. They're like, what would you do? I said, I would ask for the source data. And then they're like, yep. The data scientists ask for the source data back. And now that this ask for the source data back, they build the same pipeline over again because they can't trust what the data engineering team is giving them. So the BI team is doing that. And then they said, you know, if you could just build lineage and find patterns, I'm sure you could remove 60% of our ETL pipelines because they're all computing the same thing over and over.
Because, you know, what 1 team produces, the other team cannot trust. So organizationally, that is like the consumers are not enabled. And finally comes the governance piece. Right? You are running client. Right? Personally identifying information. You want it to come from truth. Right? So you've written some code. It's committed to get your pipelines are running. I want to know where my personally identifying information is ending up. And the data governance people have a huge challenge. Right? Adding business definitions, knowing what formula are there for every value, and tracking values flowing and, you know, for their governance need. So they're usually, for months, sitting with 1 team after another, trying to understand.
Right? And, you know, we work with some of the largest financial companies and banks. You know, they are just sitting with team after team after team for months trying to figure out how PII data is flowing and if they are, you know, meeting compliance. So I I would say that the move to code by the core data engineering team, like, has messed up absolutely every piece of the tool chain, every data constituency. Right? The governance people, the data analysts, the machine learning people, and the core data engineers and such. Right? Like, everybody's life is horrible. So that's kind of, you know, side effect of having code.
[00:20:51] Unknown:
Touching briefly on 1 of the points you made of 1 of the organizations you spoke to saying, oh, we let our data engineers write whatever code they want, and sometimes that might even mean in whatever language they want. And I'm wondering what you see as the kind of levels of utility and kind of ease of use that get introduced by saying, okay. We'll let them write code, but we're going to standardize the tool chains to say, you know, you're not going to write Scala for Spark over here and Java for Spark over there and PySpark over there and then, you know, use Dask over there and write it all in Python, and then we're just going to hope that it all comes together. You're actually going to say, you're going to use PySpark everywhere because we have Spark in our infrastructure. We've invested in that. We're going to use a single language, a single interface because that way the data scientists and data analysts can understand the code more easily versus saying, okay. We're actually going to abstract out the code entirely and go to a low code solution to say, we will let you build these prepackaged components, but you're not actually going to write all of the code directly. You're going to write this component, and then you're going to wire it together because there are defined contracts between the outputs of that and the inputs of this. And then then you're just going to, you know, stitch them together in this GUI builder.
[00:22:06] Unknown:
So I think 1 is that, as you see, right, the Hadoop, for example, has the customers who were early adopters of Hadoop, when we go there, they are in the worst shape. Right? They have big scripts, and they have hive scripts, and now they have spark scripts. Some of them are, you know, the Java application people who came in and wrote spark Java. Lesser of that, there's some Scala and a whole bunch of people who'll only write Python. And the data scientists were, you know, also sometimes contributing, and that's all Python. Right? So now that you have all of these, it is a complete mess. Right? No way to run an organization.
It's kind of the worst way to build a data engineering team because you are not going to get no search, no lineage, and even across teams, you know, 1 engineer might not understand another engineer's code. And it's also challenging to hire new people and say, okay. Now you're going to write big scripts. Right? Like, I don't want to do that. Right? That is the absolute worst in the sense that it is just that the data engineering team just gets busy wrestling with their own code and cannot, you know, take their head up and actually think too much about what they have to deliver. Now let's say somebody comes in in the team and says, here is the diktat. Everybody should use spy spark.
That is great. Everybody can use spy Spark. Now they've used it. So you have this uniform code base, but the question is, if I'm a data analyst who's using SQL, well, what I at least used to have. Right? So let's say I was using in the last generation Ab initio. And by top of Ab initio, we have great respect for that product. They did a lot of the things really well, and it's not very well known. But if I came back as a data analyst, I could get a single formula, you know, which would show going across pipelines and pulling the business logic from each 1 of them and saying, this is how it was computed. Right? And in a language that I can understand, not cold. Right? It might have some SQL function type things. It's just the basic set of transforms of how it was computed based on the absolute or hidden values. Because they are trying to understand what does this value represent.
So in terms of that, right, PySpark is still not giving the data analyst that. Right? It's like, yes. So now every data analyst who wants to understand the code can go get PySpark or did but then do they actually write Python? Right? And then let's say I want to search my datasets, you know, and understand how they came about like lineage. Right? Do you have that with PySpark? You don't, really. Most people don't. I mean well, let me say, even if you look at the products built inside LinkedIn, you know, the data hub or the metadata products built inside Uber, inside Lyft, Amundsen, and this.
Not 1 has column level lineage. Right? That means that when I pick up a column, I don't understand how it was computed. It does solve the problem within the data engineering team that, you know, new people can come up to speed quickly because the code base is a little more uniform. You can move people from 1 place to another. If you have good coding practices, they can also understand each other's code much quicker. So the data engineering org, like leveling up, is working okay. But then the consumers, right, the metadata, the governance people, the analysts downstream, they are still not served. And then there's also some productivity issues. So they so they are not going to get data as fast as they want. And finally comes to the low code approach. At least the low code approach that we have, right, is I think what we do uniquely is that you can use visual drag and drop, and we are generating really high quality, readable, performant code on git side by side. Right? So that means you're still producing the same assets, but you're producing them out of standardized Lego blocks.
Right? So in this approach, because it's packaged in a uniform way, we can provide search, we can provide lineage. And it's very easy for somebody who knew to come in and, you know, build pipelines or understand somebody's pipelines. And as well as there's nothing really you that you could do in code that you can't do in our approach. Right? Because we let users add their own visual components. Right? So you can say, here is my spark code. This is my high performance connector that I want to write, you know, maybe to my own rest service or to something else. And I package it up. And here you go. And now everybody's using that same usable asset. Right? And so in that sense, right, there is nothing that you could do in code that you cannot do, but now the same thing is available to many more users. Yeah. So ad hoc code, absolutely bad for the data engineering and the consumers.
I would say standardized code, data engineering teams, a lot of their problems solved, but their customers and less and those still are not served. But also I would say the data engineering team can only hire people who know how to write Python and PySpark and scale. So there is a little bit of limit there. But at least their org is running functionally well. And then low code is like any data user can build pipelines, and the quality of has assets they are building is superior than hand coding is what we find practically. And there's nothing that you cannot do that you could do with the code and, you know, so there's no loss of flexibility or power either.
So usually, when we look at it, you know and as we internally with our first customers are talking. Right? It's like usually when you choose a over b, there are trade offs. And we are at least coming to very high conviction. There are no trade offs. It's like low code is just in every way better than coding, and our customers are same once they adopt it. Digging more into
[00:27:47] Unknown:
prophecy, you know, you were on the show. I believe it was about a year ago. We dug pretty deep into the product as it stood at the time, talked through some of the design elements there. I'm wondering if you can give us an overview of some of the ways that the product and the platform has evolved since the last time we spoke and some of the learnings that you've had as more people have been onboarded onto the system, more people have been stretching it in different directions, and some of the ways that you have increased the kind of utility and kind of ease of use aspect to make sure that people are able to be productive
[00:28:23] Unknown:
while still having the necessary flexibility to do the kind of bespoke custom logic that they need in the places that they need it. So just, you know, as a reminder for those who have not seen the last show, so, basically, you could say there is 4 kind of what we are say are pillars. Right? 1 is there is low code development. That means a lot more users. You could say 10 x more users are enabled, and they are 10 x more productive because you can quickly drag and drop, attach these, hit the play button, see data after every step. And side by side, of course, you're creating code. Right? The second thing is what are you creating? You're creating code standardized, readable, well structured on Git using the best practices that you want to use for your mission critical applications or data pipelines. Right? So you are using git test CICD. So best of those practices.
And then we are giving the complete product where you are not wrestling with the infrastructure. Right? So we are saying, okay. You'll get development. You'll get metadata, lineage, scheduling, you know, everything that you would want, you know, end to end. And finally is extensibility. I think extensibility is what we've spent most of our last year of engineering. And it turned out to be a much, much, much harder problem than we had anticipated. So we quickly had a prototype out. We started talking a little bit about it last year, but that has been a very hard technology to build. It has taken way longer than we've thought, and it is something that is extremely important to all the customers.
We found so many use cases for it that we were not anticipating. Now once we talk about that, right, every customer conversation starts centering around extensibility, and I can give some examples. Right? So so now within this. Right? So going back to the extensibility piece. Right? So now what we did in the first pass was you could think of it as v 1 of our product, where you could go from visual to code. Right? You are doing visual drag and drop. It generates some code, and it's quite cleaned up code. Very readable. Very well structured. And then you could change the code, go back to visual. So we did the code generation. We did the parsing. And that was working fine. But then what we said is, you know, that we are going to add extensibility.
That means we threw away that entire code base and said even our standard components are going to be generated from a spec. And that spec says, here is my standard spot. Here is a spark function. That is my standard operation that I want to do. These few fields come from the user. Right? And you can't really do this in Python, for example. Right? Because 80% of the meat of that function is coming as an argument. So, you know, all you would have is arguments and very little business logic if you did that in Python. That's why people create these XML JSON frameworks. So now you have a function to represent your business logic. You have a function which 20, 30 lines which kind of gives you a quick layout of the UI and how it's connected to those values that are being plugged in by the user.
Now from that, we are generating these visual components. But then these visual components you know, now in this generated visual components, there is usability has to be very, very, very high. And that's 1 of the learning. Right? There's a lot of attention to detail, but I'll go into some of the things. Now now you're generating this UI, but then when you connect to it and you open the UI for this particular visual element, you'll see the schema coming in. Right? And now then, you know, the spec says, hey. Here is something that is going to get a spark function or an expression. You start typing. There is an autocomplete that knows this is the spark function. And you start typing an expression in the middle. You start typing something. It says, oh, it looks like you're trying to use this column that is available at this point in time.
So that kind of autocomplete, you know, and then that where it understands the kind of piece of Spark you are using. It understands the schema at this point, and it's kind of helping you along. And then as soon as you close it, even before you run it, you know, we can tell what the output schema is going to be. Right? So there is static analysis that's quickly going to say, okay. These are your transforms, and this is your input, then this is the output schema. All of that, but all of that is generated. Then if you save it, you get code. The code is generated.
And then if you change the code, of course, the parsing, we are going to add that we haven't gotten to it. But going to this layer that is generated with this really slick attention to detail. Right? This autocomplete, the schema input schema, output schema evaluation, autocomplete. And then within that, we also implemented some other things, which kind of make the life of the user easier because save button was a pain. So we've moved on to autosave. Right? And then every time users change something, they want to just hit play and see how data looks like that. So we have really, really fast but we call it interim. I I guess that's our internal term. Basically, after every transform, I want to see a sample of the data. So you hit play and we when we run the program, we instrument our spark logical plan. You know, we cache where we need to. But the main thing is if you hit play in 3 seconds, you're going to get the answer back. Right? So you you hit play. Usually, it's a second. Can be up to 3 seconds. And now you can see the new data after your transform. So very, very quick iteration loop. Right?
Then we had to do some things for quick productionization, like a single click scheduling. People are like, okay. I just wanna schedule this pipeline. And it's like, okay. Here is a button on the main development. And I meant, you know and we've also focused a lot on UI design. Right? It's working with actually, we we found a really, really good designer who's, like, a guy from Brazil sitting out in Bali. And my mom sometimes I'm jealous of, you know, his environment. He's in such a nice place. So, you know, working with him to get absolutely clean visual elements, like, it's single focus. Right? We have 16 inch MacBook Pros and have giant monitors. Some of our customers are using window machines with 13 inch screens. So it's like, okay. There is a lot of clutter. Right? If you use, let's say, Microsoft development product or something. Right? These days for something like data pipelines, you've got a thing in the middle where you have, like, the visual elements, and then you have a screen on the right that's showing you some properties, a screen on the bottom with properties, a screen on the left, and, like, business logic has no space.
So we've kind of gone in and said, no. No. No. It's, like, 1 item at a time. We clean the entire screen. Very simple visual canvas. When you click a particular, what we call visual gems, you know, you get something that comes on screen, covers 90% of it, and it's like 1 item at a time. Very simple. Very clean. So a lot of these things have gone a long way, and and maybe I'm going too long on this. But I think the attention to detail in design and really caring about your data users and making their life easy, you know, that's something where we were treated a lot on. Stepping back to Yancea, it's like the 2 main things. Right? It's like 1 is this extensibility that has been very challenging to build and is used a lot. And second is the design element of it. I think we focused on making that significantly better, and that has made people's lives better. And maybe, if I may, I think it might be good to give some examples of extensibility.
Right? We have 1 customer who's coming and saying, okay. I want to do ingestion, and I have to ingest these 50 tables, 100 tables from this data. Okay. And now for many of them, I want to do SCD to merge. And then those users tried in Spark to write the code for SCD to merge multiple times, and their code wasn't all that. They found it challenging. Right? I don't know if your user are familiar with a CD to merge. It's like you have on the operational side, let's say, I'm in there's an ecommerce site, and I have an address. And then I change it. I change it. I change it. But on the analytical side, I want to keep a history. I had this address from this date to this date, this address from this date to this date. So when you take that operational data and actually put it in the data warehouse, you've got to look at the existing values of those. Then based on that, modify the existing values, modify the new values. You know, the code for that is complex.
And they're like, we have to do that. Can you give us a standard transform for that? We are like, here is your s c 2 merge. Then they're like, you know what? I want to just loop over all the elements, over all my 50 tables, and do the ingestion. We're like, okay. Here is a table of configuration, and and that's something we are building. It's not released in the product, but we have a prototype working for them. It's like, okay, here is you have a loop subgraph where you can loop over all the tables from the input and ingest each 1 of them 1 by 1, and apply some basic cleaning rules and the right kind of merge that they want for each table. But for those, we are creating these, you know, Lego blocks. But all they have to do is connect them. Right? Then there is data validation and quality rules. So now we have this 1 vendor who's getting data from 20 sources, and they've said, I want this set of rules to apply. You could call them data validation rules. You could call them quality rules. But then, you know, these rules have got to be reusable in any pipeline I can say. So let's say I'm an insurance person. I'm getting this from 20 different vendors. I change the schema a little bit to normalize it, and then I want to apply the same set of rules. Can I have the set of rules as a reusable thing that I can apply to any data of this particular kind?
So we've had that, right, as a reusable component or, de identify. That's a very interesting use case. We've got some financial companies. They've got new data coming in, And they are like, I want to de identify it. Maybe they'll move the dates by a little bit. They'll do a certain series of transforms that they do to remove identification. You know? And then in all of these cases, right, why I'm giving so many cases is the number of cases we are finding is just numerous. And in each of these cases, they want to say this is my standard way to de identify datasets. This is my standard way to write data quality rules. This is my standard way of doing ingestion. And they want to create these visual gems, visual components that are, you know, unique to them, and then they just reuse them.
You know, they create it once. Everybody can reuse them. Deidentify. Go to the menu. Put the de identify block there, and you're done. Right? And if you were using SQL, you could never do that. Right? I think building extensibility has been challenging, and we are finding tremendous use for it. And then, of course, I talked about design.
[00:39:10] Unknown:
So now your modern data stack is set up. How is everyone going to find the data they need and understand it? Select Star is a data discovery platform that automatically analyzes and documents your data. For every table in select star, you can find out where the data originated, which dashboards are built on top of it, who's using it in the company, and how they're using it, all the way down to the SQL queries. Best of all, it's simple to set up and easy for both engineering and operations teams to use. With SelectStar's data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets.
Try it out for free and double the length of your free trial at dataengineeringpodcast.com/selectstar. You'll also get a swag package when you continue on a paid plan. 1 of the things that comes up a lot whenever the subject of low code or no code platforms is discussed, particularly when you're talking to a group of engineers, is you'll pry my code out of my cold dead hands. You know? I'm curious how you approach some of that resistance to, you know, what some engineers will see as kind of dumbing down their capabilities or, you know, saying, oh, well, I can do all of that in code. You know, why would I use this UI builder? Because I'm not getting the expressivity that I'm used to and just sort of the some of the answer to that kind of knee jerk reaction to, but that's not the right way to do things.
[00:40:31] Unknown:
Here, honestly, it's we are finding some challenge. Right? And the different kind of scenarios we encounter, and they're very interesting. So 1 team says, hey. We are a group of 5 Scala developers Spark Scholar developers. Our high salaries depend on the fact that we know how to write Spark Scholar. We look at your product, and we know we can do that much easily and we produce the same results, but it will reduce my salary. So no. Thank you. Right? So there is those teams. And, you know, there we just give up. Right? And say, okay. You're not the right team to talk to. There's no convincing them. Right? It's it's just not in their interest. Now the other thing we are finding is there is another mix where you have a few programming data engineers, and they want to enable a whole new set of people. Right? So what they're saying is, I am the central data platform team, or I have these few programming data engineers, and they are kind of not able to meet all the requirements being thrown at them. And they're like, I want to enable the organization. I think through them, the sellers so, basically, instead of them building data pipelines for all the users, we talk to them and say, you can build these standard Lego blocks. You write your spark code and said, this is how you do a. This is how you do b. This is how you do c. And now you just go into defining standards that people use to build their pipelines instead of building the pipelines for them. It's kind of an up scaling for them. Right? The leverage they have and the value they're adding to the organization goes up quite a lot. And I think that gets positive reception. They're like, yeah, it's all about enabling these people. And then what happens is all the simpler pipelines get taken away by other users.
And then this team can focus on, you know, cost and performance and, you know, handling the difficult cases. Right? And sometimes they will go in and write scripts. Right? And write some special code for things that need to be done. Sometimes it's reusable Lego blocks. Sometimes they might have to do something unique. But now all of their time is going in in challenging areas, in defining standards, in solving the hard problems. So I think those teams are receptive. I think sometimes the managers of those teams, their counterparts and their business teams are not happy with them, and they're trying to find a way. So those are happy. And let me add 1 more category to that. There's sometimes, you know, the first team has brought in Spark and Databricks often, and that they are productive. And now they're like, oh, we've set up this great platform. Now we want everybody to use it. And then everybody else is like, I don't know how to write Spark code. And so now they're stuck with this great thing that the system they've put together, and they can't get users. Right? So we get some people, like, we have a company that is into the housing Internet companies. Right? And they are like that. They're like, you know, we want to enable everybody on this now. So so that category is quite receptive in the data platform team, you know, who are thinking like that. And then there is the what you could think of, like, kind of representing the line of business group. They just want to deliver data, and they don't really care. Right? It's like their organization said they're going to use Spark.
Some user are using it right now. They need to get their work done. Some of their source data is in Spark and in their data lake or in Databricks. And they just have to do get a bunch of work done. They are primarily focused on delivering analytics. So they they are the 1 who will pull. Right? They will say, hey, we want low code. So we kind of have these constituencies. The people who are like, you know, we are going to code everything, there is no debating. It's but I don't know what the future of that is. There is no hope for them. Right? We give up. But the other thing is at some but the modern data stack, I feel, is kind of bypassing them. Right? It's like if they're not, you know right now, they can say, okay. You know, I am the only 1 who knows these data pipelines. Some of them in Hive, some in Spark, some in Big. And nobody else does, so I have job security, and I am powerful.
Right? I mean, that is work for a while, you know, then people will run out patience and they'll get in trouble. So I think there's going to be, you know, after maybe, you know, if we meet it, you know, talk again in 2 years, 3 years, there might be a consensus in the industry that script was a bad idea.
[00:44:54] Unknown:
There's this question of the sort of ease of utility that's offered by low code tools and by the various commercial offerings that are out there. And you made an interesting point about the modern data stack being this kind of disaggregation of the different individual components, but with a fairly straightforward integration play to be able to say everything is oriented around the data warehouse or the data lake. Everything plugs in together nicely. You just need to kind of pick and choose what you want. So there's not as much kind of heavyweight integration, custom code that's necessary to wire everything together.
I'm just wondering what you see as the next iteration and the next generation of data engineering where this level of kind of ease of use and accessibility beyond just the very hardcore engineering teams is table stakes to be able to actually say, this is a viable tool that I want to release, whether it be in the, you know, open source ecosystem or proprietary and just the level of interoperation and collaboration that is at play to make sure that, you know, just because somebody says, okay. I'm going to use prophecy to make it easier to build these composable units together to empower my entire organization, you're not left out in the cold if you say, oh, actually, I now wanna go and start integrating with this other piece that's brand new and experimental, but has all these, you know, interesting and useful capabilities.
[00:46:17] Unknown:
This is a very, very good point. I think as the new modern data stack evolves. Right? And that's 1 of the powers of code. Right? And if you say I have a visual tool only, then you're kind of stuck in your silo. It's hard to integrate with the ecosystem. Right now, what we are saying is you have this visual layer productivity, but then you have all the power of code. That means as new technologies come, you can integrate with them. Right? Even if you use a different metadata system. Right? You can write a visual component that just connects to that metadata systems and registers, say, a feature. You can say this is my feature store and just register this column as a feature or things like that. Right? So in this sense, right, the ease of use must be there because the current way people are going, where it takes just so much effort to get basics done. That is just not sustainable.
So that part we feel is at least going to slowly just die off because you're not really getting anything for that effort. Now because a lot of data platform teams are integrating all kinds of different tools, And I've come from the Hadoop Hive. I've kind of, you know, had a stint there. There were 50 tools, and then, you know, they all went away. And customers are like, Databricks and Snowflake. I want just 1 thing. Do the job for me. I don't want to spend my life integrating stuff. So there is that. Right? That there is a move in the engine space towards a single complete solution. And then the question is that will this come to the data tooling layer as well?
And in the data tooling layer, right, today, you know, with these things for even the data engines and data tools. Right? And data tools are getting better. But there is a lot of integration effort. Right? I'm going to use all these different data tools, and I'm going to stitch them together. It's not as bad as it used to be because if you were using Hadoop, you needed to know the difference between map join and shuffle join. Right? And it's like, why do I need to know that? It's like, because that's a horrible optimizer underneath. Right? Those engineers haven't built a good system. Now if you were going to interview somebody for a Snowflake user, like, would you ask them the difference between map join and shuffle join? No. It's like Snowflake's job. They've got a query optimizer. Right?
So the table stakes have changed in the data space. Right? Where now if you build another data platform where somebody has to come in and say what the difference between map join and shuffle join is because you don't have a good query optimizer, you're dead in the water. Right? That that is table stakes. Now going back, right, if you're asking if doing the simple things is hard, I have to write the script. I have to create a jar. I have to do, let's say, spark submit and then write to a table, go read the table and figure out if it was good. Right? So now it's like, for the productivity of the overall organization, just the simple things have to be simple. After the low code products like prophecy and others become widespread, you know, it's no longer going to be sufficient to say that, you know, that, okay, I'd only provide this 1 thing.
Right? Because, okay, you provide me with some simpler development. Do I get lineage? Do I get search? Do my data analyst know this? Do I have governance? Right? Once you have a product that can do 10 things, just like Databricks does 10 things. Right? And then you if you have all of those, like, can I go and open a spot company today or some other processing engine? And they're going to be like, do you have the demo? Just like I need that. Right? So the tables take as 1 or 2 complete products take shape, then complete tooling products take shape. I think there is going to be the usability and completeness are going to become 2 axis on which, you know, the table stakes are going to be that 90% of your job is very simple.
And then the question is that other remaining 10%, the hard use cases. Right? I want to use a new experimental framework. I want to be able to go down and optimize my stuff. And then that power has to be there for the last 10%. And I think having 1 or the other is not going to be sufficient. Today, coding is giving you the power, but not the productivity. Right? And the old detail tools gave you the productivity, but not the power. Right? And we are saying we're going to give both. But if we do provide both in the market, we look at it and we're like, that's going to become table stakes. So it will be interesting to see how the, data stack kind of evolves, but it will be challenging. Right? So for example, let's take the example of Metilina. You know? They have built a very easy to use ELT product. Right? It's visual drag and drop. It generates SQL.
Right? That's great. If I want to do something outside of SQL, how do I do it? Now it's like, okay. So that then you run into a problem. Right? I mean, they solve their use case. Right? The simpler use case, they'll be able to solve. But then it's like, okay. You know, once visual drag and drop becomes stable stake and everybody has that, now how do they compete? Right? And I think that's where, you know, having power and code under underneath at least for us is something, you know, that we think will become really great.
[00:51:27] Unknown:
Absolutely. And in terms of the applications of what you're building at prophecy and what you've seen people using it for and some of the ways that it has accelerated their productivity and time to value. What are some of the most interesting or innovative or unexpected ways that you've seen it done?
[00:51:43] Unknown:
Yeah. So I think what we are seeing is some of these are business use cases. Right? So we've had some customers who were like, I don't have a data engineering team that can write a lot of spark code. They had even outsourced to, you know, some spark developer consulting firms, and they're like, they're not getting enough turnaround. Right? So 1 use case we are seeing is the insourcing. Right? We have a bunch of customers. They're like they're like, okay. I'm going to take a new college grad who's built, you know, knows, done some SQL and a few other things, 2 years of experience, doesn't know much about Spark. I'm gonna put that person on prophecy. Can they build a pipeline? How long do they take? Right? And as they build pipeline very quickly, we've seen then them bringing, you know, in housing their entire data engineering pipeline. So I think we've seen that. And for those teams, you know, they are asking us to write. Today, I think maybe later we'll end up with the whole package that is publicly available.
Transformations that we don't have. But, you know, today, they are coming and saying, hey. I want this. I want this. I want this. All the common things they want to do, they are doing that. And, you know, the other thing is I don't know if it's surprising, but it's we've built and it's also in private beta. So we we have a prophecy 2.0 release coming up in early June. And we already have some customers on it. And then, you know, going into the Spark summit or the Data plus AI summit as it is called now. But, you know, in that, we have a lot of users using, like, custom blocks for doing, I guess, all kinds of things. So that's 1 thing. The other thing is subgraphs seems to have become really important.
People want these reusable subgraphs, and they've pushed us for it. So we built that. Then they're like, I want to get job metrics by default and quality metrics by default, and we've built that. But then the other thing is, like, there are really people are liking the Gitflow. So that's another thing where and they've asked us for a visual. Like, we were not sure how many people would take to that. Right? And they've come back and said, hey. We've got to simplify it. But we've got visual developers who are using branches and git and all of that. Now they're asking us, hey, can 2 of us be developing in a branch, and can I see the cursor of the other person?
And can they come and help write a, you know, a transform for me the way I can do it in Figma? So we are seeing where they're moving to is, like, you know, even the simple visual developers are like, oh, I yeah. This git branch thing is great, and I want real time collaboration. And we don't have that. But yet but we are seeing that. That's interesting. The other thing that we are getting right and it's maybe not again. It might be more of an industry wide thing. But we're seeing a second paradigm, which is some of the users who are more on the data analyst side don't want like, if you want to be focused around tables or that is kind of, I guess, a table centric rather than a pipeline centric view. Because, like, if you go to data engineers, you know, they have 7 sources and 3 targets, and all of that is for efficiency.
But then if you look at other products, right, where you can say I have a pipeline per dataset, which is very focused on, okay, this is how I produce this dataset based on other datasets. That's kind of a simplified worldview that we are seeing and being asked for. We don't have that yet. But that we do see that DBT has that. Databricks' Delta Live has that model where you have a date 1 dataset coupled with, you know, 1 set of transform just to compute that dataset based on other datasets and kind of, you know, turtles all the way down. And that's interesting because it's not focused so much around efficiency, but around understanding.
And if it can be efficient enough, maybe that will become important. So we are seeing some new patterns, I think. But I think some of the common cases with prophecy are primarily I think they are just using us. I guess, the other thing I could say that I was surprised is, like, some of us are using us in getting analytics users onto this who are focused on like, we have a company that's like, I want to get all the trading floor people to use this because they are stuck behind data platform team. And those are very different skill set. But we are finding interesting usages, I guess, of different kind of profiles of users that people are trying out of how they can be. You know, how can they do that? So interesting things. I think the organization of teams is interesting. You know, the different patterns we are seeing,
[00:56:12] Unknown:
you know, pipeline centric, model centric. Yeah. So seeing some interesting patterns. And, of course, like I said, the other pattern is, like, can you give me a visual component for, you know, this complex thing I wanna do and just kind of having that? I think those are the things that we've, you know, seen recently the most of. Yeah. The note that you just made about the sort of pipeline centric versus model centric modes of collaboration is definitely interesting and maybe something that we can explore at another time of, you know, how do people break down the boundaries and the areas of responsibility at the organizational level to be able to say, you know, we're going to collaborate, you know, from data engineering through to data analysts, through to business users along this single pipeline to be able to say, I wanna go from, you know, this application source all the way through to this business intelligence report versus data engineering who's responsible for all of these sort of source data collection and curation, and then data analysts do all of the modeling, and then the business users do all of the sort sort of utility of that. Definitely an interesting sort of variation of paradigms paradigms based on sort of what organizational capacity they have.
[00:57:15] Unknown:
Yeah. And just a quick note on that, it just seems like there is some amount of movement in the industry towards these unified data teams rather than the vertical silo, you know, rather than broken by vertical silo. And that's that's another interesting trade off that we are thinking of is, like, should we view ourselves? Right now, they're saying, okay. Data engineering, whoever wants to do it. But then should we self thinking that do we grow toward the direction of saying, here is a unified data team. That seems to be the models people are slowly moving toward. And do we say that, okay, we're gonna cover the whole thing for them. There's some interesting design points. You know, we have to see how the market evolves, but definitely some movement in that direction. Absolutely.
[00:57:55] Unknown:
In terms of your experience of building the prophecy platform and growing the business and now going through your series a and starting to scale the organization more, I'm wondering what are some of the most interesting or unexpected or challenging lessons that you've learned in that journey?
[00:58:09] Unknown:
Yes. So I think we started in a direction that was hard. I would say that looking back, we said that let's go to some of the largest enterprises and look at some of the most complex use cases, and we started there. So the bad of it is it took us quite a while to show customer traction. Right? So quite a while, we were on seed round. Right? We had seed 1, seed 2, seed 3. I mean, from the same investor. Right? It was you could just say seed broken down into 3. It took us a lot of time, but then I think things like extensibility, things like huge amount of business logic reuse, being able to use business rules and reusable pipelines. And a lot of those stuff that came into the product was because we started with the more complex use cases.
And the focus on getting 100% open source code, CICD. So we learned some of the best practices from them, and I think that learning was was very valuable. And then, you know, we've had to scale the engineering arc and continuing to find the top level talent without having a brand name. It's been challenging. I think we've not had good luck in Bay Area. It's just the price to quality ratio is not as good. A lot of the people are locked into large organizations. That is maybe more outside the technical domain. But, you know, the US has got to fix, especially in Bay Area, like, the housing cost and immigration. Right?
It's like those have got to fig get fixed because otherwise, there's no way for us to hire people here. The cost of living is so high and the talent is so little. Right? It's like so there is that challenge. Right? So we had to continue to grow the team, but then we did have some communication cost growth. And we were very, very deliberate in saying we have to go fast, fast, fast to build the product. So we kind of have gone with this model of very high quality engineers, very lesser communication cost, and kind of built a model like that. So I I think that has worked really well for us. I think the series a fundraising was quite challenging.
So first investors would say, oh, in the cloud, everybody writes code because they're writing it today. And therefore, that's what everybody will do to the end of time. And then they're like, no. No. No. They're going to do drag and drop there, but nobody's doing it. Like but the product doesn't exist that lets them do it. Right? So that was quite a challenge, and we had a whole bunch of VCs turn us down for that. And I was actually surprised. I think it was quite because a lot of the big brand names. Right? Because who are supposed to understand stuff didn't really.
And I was just surprised. Right? We got money from Insight Venture Partners. Right? They they are, of course, a giant form. I think they're investing out of a $19, 000, 000, 000 fund. Right? They just operate at at a different level. But then, opposite to what we expected, they were actually technically more competent. Like, we had George Matthews come in, Lonnie come in. Like, they understood the technology, the market way better than the Bay Area VCs who are supposed to be all this. Like, I had literally top tier firms. You know, you know the names. I won't call them. Come and say, oh, you are building this local thing.
Oh, you are competitive with airflow. So we can't invest in you. And I'm like, where is airflow? Where is our product? Right? I would say that the general competence level of you know, because seed round, you go for a lot of small investors. Right? That's what we did. And then series a is when you go to these top brand name forms. And, you know, the it was very hard to convince the top brand name forms of anything other than what's already there, and they didn't understand the ecosystem well. I would say that. And I was surprised, like, our investor, George Matthew, he understood everything we said, like, in first 5 to 10 minutes. And, wow, this is amazing. Right? So in that, you know, I wonder how VC would get disrupted further by these, you know, just more competence coming into the market. That was a very, very interesting, learning thing for us. The go to market side, scaling the product now, we've started people using our public SaaS version.
So we are in Databricks Connect. You can just click on it, and we just open in another tab connected. Or you can go to app.prophecy.com and just start using the product. Right? And then that we've had to build the muscle operational muscle of being able to keep the service up at all times with high quality. And now we've also got to go for all the, you know, compliances and stuff. So that's another journey. But I think those have been interesting. I think the the other thing is right after series a. Right? We got a big series a, 25, 000, 000 upper end of the market.
And now it's, you know, the planning of how to organize and how fast to go. You know, that's an interesting challenge. There's just so many decision points and so many things that you could get wrong. So it's been, like, do I hire a VP of marketing and a VP of sales? Or do we build them bottom up? Do we build them top down? Like, can I hire a VP of engineering? Will the VP of engineering be able to sell? Because we can talk to some senior executives from, you know, Uber and other companies for VP of engineering, which we still don't have 1. It's like, they've always had an organization feeding candidates to them and a big brand name. It's like, will they be able to hire good quality candidates without that? I think the interaction with venture capital was interesting. The building and growing an organization, being able to plan is has been good. But then, you know, I think the other thing is some of the things I think we've done well is continue to narrow our focus.
I think right now, everybody owns 1 thing in our organization. Something, you know, that was reaffirmed recently by Slootman's book, you know, who the CEO of Snowflake. Right? It's like, we've kind of taken, like, very fast speed and 1 priority for everybody to do that. And then I think it's been very, very interesting so far. And a lot of new muscles to build very quickly. I guess all startups go through that. Right? It's because it's, like, engineering, but now it's, like, shipping and production. And now it is the data engineers in the field onboarding customers And, you know, the marketing getting your word out, and it's like you have to build all this machinery very, very quickly.
And, you know, every new area is something you don't know. Right? And some decisions are hard. Right? So it's it's been a mix of what you do. Sometimes you just go hire somebody. Like, we got Sushant. Right? Sushant has been in in the industry for 26 years. He's a VP of marketing. Right? He he grew up in LA and then went to Stanford and, you know, we've been doing marketing for a while and, you know, a strong product marketer. Sometimes it's you are doing that. Right? You are saying we definitely need that. And then sometimes it's like, do I figure out how to hire a UI person and a UX person and a product manager? It's like, no. We're not gonna do that. So I just picked up Figma, worked got 1 designer, and both of us did a full UI overhaul because that is quicker than figuring out how to hire for these 4 different roles. It's been quite an interesting and exciting journey. I think earlier, we were just kind of always running out of money and in survival mode.
And now after series a is, you know, things have become a lot more fun. So definitely
[01:05:37] Unknown:
a very interesting experience. You know? I recommend as many people as possible to go through this. It's been a joy. So are there any other aspects of the work that you're doing at prophecy or this overall question of developer expressiveness and performance optimization versus developer experience and accessibility to the data workflows at the organizational level that you'd like to cover before we close out the show?
[01:06:01] Unknown:
I think I would like to just summarize. And this is kind of a point of view now that we strongly believe. I mentioned it in passing earlier. But we are just looking at it and saying, you know, that local data engineering. Right? Today, we are talking there, and I can be pretty certain that, you know, 3 years from now, everybody will buy it. You know? And these things change. Right? Earlier before Snowflake, everybody was using Hadoop. So I think what we believe is low code is just 10 x better than visual ATL twos and 10 x better than coding. Right? Because it has best of both worlds. It gives you, you know, so many more users quickly building pipelines, reusing business logic. Right? It's standardization.
New person can come in and learn very quickly. You get metadata and search and lineage, a complete product. Right? You get all that right. But at the same time, you have your code on get infinite extensibility. Right? Get CICD best practices for robustness. And then you put that combined approach that we've developed against either code or visual ETL. And you can just see that it is so much superior. So I think we kind of stumbled into it piece by piece solving customer problems. But as we now look back and see, we just it's like, okay, you know, this is just a better approach, And 1 we are very, very excited about.
I think we'll, would love to connect back after a year or 2. And, you know, hopefully, that will a lot more of this approach in the industry.
[01:07:32] Unknown:
Well, for anyone who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:07:49] Unknown:
I think the tooling and technology for data management today, I would say the management layer or the metadata layer or the information layer is just not there. Right? There's a few startups doing it. Not quite there. If you look at the cloud, right, there's technical information. But do I have business definitions? Are they flowing? So in the data tooling, the information layer. Right? Let me give an example. Right? Let's say I say, Tobias, I want your team to focus on data quality. I don't think we're doing very well on data quality. Let's focus the next 2 months on data quality.
It's like, do we even know what we're going to fix? Right? And after you fix something, can you come back and report what you fixed? Right? It's like what you are missing is a place where you can understand your data engineering, your performance metrics, you know, your quality metrics, and then report on it all. Right? I think that's a basic layer where I can say, okay. We invested for 2 months in data quality. We have 80% coverage on the datasets about this data quality. Within that, we found these were at this data quality, and we've improved these things to this data quality. I've got to be able to report. Right? So that business level matter to rate Hive metastore is there. And that's just technical metadata for, say, spark or something and for glue. But then the question is, is there no cloud has a good metadata product that allows me to merge this technical metadata with metadata up from various things. Right? Execution, context, business context, cost context.
You know, put all of it together and be able to report on it and take actions on it. So I think that whole layer is just missing from the stack. You know, we haven't gotten to it, but it just seems like a big missing piece in the stack.
[01:09:43] Unknown:
Absolutely. Well, thank you very much for taking the time today to join me and share your experiences and insights from building the prophecy product. It's definitely a very interesting tool and a valuable approach, and I appreciate the time and energy that you and your team are putting into that. And I hope you enjoy the rest of your day. Thank you so much, Tobias. A delight as always to talk to you. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site of data engineering podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Welcome
Guest Introduction: Raj Bains
Raj Bains' Journey into Data Engineering
Categories of Data Tools and Their Impact
Organizational Productivity in Data Engineering
Prophecy: Evolution and Extensibility
Addressing Resistance to Low Code Solutions
Future of Data Engineering and Tool Integration
Innovative Uses of Prophecy
Unified Data Teams and Organizational Collaboration
Challenges and Lessons in Building Prophecy
Summary and Future Outlook