Summary
The modern data stack is a constantly moving target which makes it difficult to adopt without prior experience. In order to accelerate the time to deliver useful insights at organizations of all sizes that are looking to take advantage of these new and evolving architectures Tarush Aggarwal founded 5X Data. In this episode he explains how he works with these companies to deploy the technology stack and pairs them with an experienced engineer who assists with the implementation and training to let them realize the benefits of this architecture. He also shares his thoughts on the current state of the ecosystem for modern data vendors and trends to watch as we move into the future.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.
- So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan.
- Your host is Tobias Macey and today I’m interviewing Tarush Agarwal about how he and his team are helping organizations streamline adoption of the modern data stack
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what you are doing at 5x and the story behind it?
- How has your focus and operating model shifted since we spoke a year ago?
- What are the biggest shifts in the market for data management that you have seen in that time?
- What are the main challenges that your customers are facing when they start working with you?
- What are the components that you are relying on to build repeatable data platforms for your customers?
- What are the sharp edges that you have had to smooth out to scale your implementation of those systems?
- What do you see as the white spaces that still exist in the offerings available for the "modern data stack"?
- With the rapid introduction of so many new products in the data ecosystem, what are the categories that you see as being a long-term necessity?
- What are the areas that you predict will merge and consolidate over the next 3 – 5 years?
- What are the most interesting, innovative, or unexpected types of problems that you and your collaborators have had the opportunity to work on?
- What are the most interesting, unexpected, or challenging lessons that you have learned while building the 5x organization?
- When is 5x the wrong choice?
- What do you have planned for the future of 5x?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- 5X Data
- Snowflake
- dbt
- Fivetran
- Looker
- Matt Turck State of Data
- Mixpanel
- Amplitude
- Heap
- Bigquery
- Narrator
- Marquez
- Atlan
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
So now your modern data stack is set up. How is everyone going to find the data they need and understand it? Select Star is a data discovery platform that automatically analyzes and documents your data. For every table in select star, you can find out where the data originated, which dashboards are built on top of it, who's using it in the company, and how they're using it, all the way down to the SQL queries. Best of all, it's simple to set up and easy for both engineering and operations teams to use. With SelectStar's data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets.
Try it out for free and double the length of your free trial at data engineering podcast.com/selectstar. You'll also get a swag package when you continue on a paid plan. Your host is Tobias Macy. And today, I'm interviewing Tarush Agarwal about how he and his team are helping organizations streamline adoption of the modern data stack. So, Taroosh, can you start by introducing yourself?
[00:01:53] Unknown:
Hey, Tobias. Firstly, thank you so much for having me on the show again. Really looking forward to being here and hopefully adding some value to your audience. Happy to sort of introduce myself. You know, I've spent my career in data. I sort of started being 1 of the early engineers on the analytics team at Salesforce. And most recently, I worked through WeWork, gotta help scale the data team, you know, a sort of data organization. We supported 10 to a 1000 employees. And, basically, I found myself in Bali at the beginning of COVID and a sort of company called 5 x.
[00:02:24] Unknown:
And so you mentioned that you got your start working with Salesforce and that you helped to scale the data team at WeWork. I'm wondering if you can just share what it is about the overall ecosystem of data and that problem space that keeps you interested and keeps you working in it. When I got started, I sort of started as a software engineer, and very quickly,
[00:02:44] Unknown:
I realized it wasn't for me. You know, the whole idea of working on a particular feature, you know, for an extended period of time, wasn't that interested. I was a lot more interested in, you know, how do you enable the business to actually hit what the business wants to achieve. And, you know, I really found that with data, you, you know, you get to sort of get a bird's eye view of what's happening in the business, but also be able to zoom down and really sort of focus on, you know, what's needed in order to sort of drive change. And, you know, when we sort of got started 10 years ago at sales force.com, the, you know, infrastructure needed just wasn't there. You know, we were doing a lot of this stuff ourselves.
And 1 of the things which the modern data stack movement has really enabled is, you know, companies of all sizes to be able to ultimately bring some of the value, which sort of companies like WeWork and Salesforce sort of can do at scale. And sort of very focused on solving that problem of how do we make it easy for 95% of businesses that are not WeWork and Salesforce, which can deploy an army of engineers. How do we solve the sort of problem of helping them focus on business outcome through sort of data and analytics?
[00:03:58] Unknown:
And so that brings us to the work that you're doing now with 5x and helping organizations adopt the modern data stacks. I'm wondering if you can share a bit more about some of the origin story and what it is that you are building at 5x and some of the ways that your focus and operating model have shifted since the last time we spoke about a year ago?
[00:04:20] Unknown:
Yeah. Gosh. I know we were just talking about it. It's been a year, and everything seems sort of super different. So, you know, just sort of setting some sort of background. Right? Like, the amount of data companies is sort of collecting is increasing and increasing. We sort of read somewhere that the average sort of start up today has got between 10 to 12 sources of data. And, you know, while data is increasing, it's becoming increasingly harder to extract business value from data. And if you think about it, the modern data stack is this huge movement, and we have, you know, all these amazing vendors like with Snowflake and DBT and Fivetran and Looker. But also that the the stack is extremely fragmented.
You have, you know, a different best in class vendor for every layer of the stack. And, you know, this kinda starts from data collection, ingestion, storage, modeling, reporting. You have data lineage. You have decision making. You have reverse ETL, machine learning experimentation. I just named, like, 10 layers of the stack. And if each of them have their own vendor, it's, you know, very, very difficult for a company which is not WeWork or Salesforce to be able to go sign all these enterprise contracts and sort of sort of stitch it together. So, you know, what we're really up to is how do we work with all these best in class vendors and sort of pre stitch them, you know, figure out the entire enterprise contract piece, and allow anyone to, you know, go on to 5 x, enter the credit card, and and sort of 5 minutes later, they have the modern data stack all configured, all ready to go so that they can kind of focus on the business outcomes instead of having to worry about how do you set it up or instead of worrying about sort of signing these contracts.
And I think, you know, when we were speaking a year ago, you know, where we were was this best practice on what this data stack is and how to set it up. And at that point, we were really going into companies and sort of teaching them how to set it up in such a way which is gonna really streamline operations and, you know, make it really easier for them to sort of scale and grow. Sort of what we realized after speaking to a bunch and bunch of customers is that, you know, every company as they scale encounters this problem where they need this stack. They need to sort of set it up and get value from it. So, you know, we were onto the right track over there. It turns out that 99% of these companies want you to do it for them instead of learn how to do it. So what really changed is instead of teaching them how to do it, we sort of decided to go build this platform so that everyone can get this course back out of the box. And then we also started a sort of marketplace where we can actually train, where we can actually hire, you know, amazing data engineers and train them on the best practices on the data stack on, you know, how to get set up and then allow companies to, you know, embed these engineers sort of directly into their business.
So, you know, removing training companies to, you know, basically providing the data platform as a service and then training the engineers so that there's a lot more standardization at the engineer level and then allowing companies to use the platform and embed these engineers so that they can ultimately drive these business outcomes.
[00:07:34] Unknown:
As far as the modern data stack, you listed a few different vendors, and I know that there's been a lot of debate and conversation about what the term means, and there are some very philosophical views about it. I'm wondering if you could share your perspective on what you actually mean when you say the term modern data stack.
[00:07:55] Unknown:
You know, I think, again, 10 years ago, none of this really existed. A lot of this, at some point, sort of got started with, you know, with sort of big data revolution and kind of, you know, the Hortonworks and sort of Clouderas. And, you know, Hadoop and Spark and all of these kind of tools came out of it. And I think what's kinda become increasingly clear in the last few years is there's been, like, multiple sort of technologies which have really, you know, started to become mainstream, and the data warehouse is sort of 1 of them. And, you know, with the data warehouse, you know, came this sort of ecosystem of tools around the warehouse, like, you know, ingestion providers like Fivetran and then modeling on top of the warehouse. You know, the the advent of basically sort of SQL at scale sort of began this, you know, what I refer to as the sort of modern data stack movement. And what's become more clear is that, you know, technologies like Spark and sort of Hadoop have kind of become more of, like, niche use cases where, you know, they're still sort of relevant. You know? They're still extremely good at at certain use cases. But for, you know, the majority of companies, for the majority of use cases, using something like the data warehouse has become more sort of mainstream.
And, you know, kind of what I refer to is, you know, the sort of modern data stack. It's, you know, the sort of tools and technologies around the sort of data warehouse and sort of that ecosystem. Does that make sense? Not to say that, you know, things like Kafka or, you know, things like sort of Spark and Hadoop are sort of not relevant. They're extremely relevant, and, you know, they start kind of playing well into this ecosystem at some level. But, you know, this this is, like, a core kind of ecosystem around
[00:09:31] Unknown:
the warehouse, and I think that's really, you know, what we have sort of referred to as the modern data stack. Yeah. Absolutely. And that's the common thread that I've seen in most of the conversations I've had about the modern data stack is that it is reorienting
[00:09:45] Unknown:
around the data warehouse as the focal point of everything having to do with data. I so totally agree. And, you know, I think sort of Snowflake's IPO in some ways, you know, made that fairly clear. Right? The sort of largest IPO in tech history happened to be a sort of data company. I think that's sort of setting the path for the next 5 years. And if you, you know, just look at someone like Matt Dirk's sort of data 2021, he's got this sort of massive diagram of, you know, what are all the different categories inside the data space. And if you start looking at how many of them kind of are most recently, you know, on top of the warehouse or on top of technologies powered by the warehouse, it's kind of becoming clear that, you know, that's the direction we're sort of heading in. In this time from when you first started working on 5 x data and had this model of
[00:10:35] Unknown:
being an enabler and helping organizations build their own internal capacity for being able to manage advanced analytics to where you are now of helping to sort of reaggregate the modern data stack and provide services and capabilities on top of that. What are some of the biggest shifts that you've seen in the market for data management and some of the ways that organizations are approaching it? I
[00:11:03] Unknown:
think the story line is getting more and more clear. Right? Like, where companies begin, you know, they might go try to assemble these pieces themselves. They might start up with a BI tool in a warehouse. Increasingly, as they sort of mature, right, bringing in the right ingestion layers, bringing in the right sort of modeling layers. So, you know, we're starting to see that playbook sort of happen over and over again. And, you know, the evolution of that playbook is, you know, at some point, you require an additional set of tools. Right? Whether it's starting to collect a lot of datasets, you sort of really care about data lineage or sort of data catalog, or you you wanna start pushing this in a sort of intelligence into sort of back into some of your application systems or your marketing tools with things like reverse ETL. So, you know, what we have seen is that the core stack is starting to be more and more, you know, so well understood and and sort of finalized, and we're starting to see the use cases around the edges. But, also, you know, in terms of from a security perspective, from compliance, from GDPR, from PII, you know, what we're seeing is that a lot of companies are adopting this stack and then having to kind of figure out these pieces on top of it. So, you know, it's just a matter of maturity before companies need to start figuring out how are they gonna deal with PII information, whether it's gonna be encrypted. You know, how do they deal with the GDPR process and making sure that, you know, the data ecosystem's compliant, you know, for the companies inside regulated spaces. You know, things like governance and things like audit are sort of relevant. Right? And, you know, to some extent, because a lot of these companies set up these tools in their own way, they are really building a lot of these sort of security guards and rails and access control and governance also in their own way. There are a few platforms which do it, but, you know, for the most part, they they don't integrate into the sort of modern data stack sort of space. Right? So, you know, 1 of the things we've seen are, like, the vendors are sort of fairly standardized.
But, again, because there's no standardization in how these vendors play together, how they're set up, you know, how do you do end to end security and and sort of user permissions that companies have to go figure out this sort of security layer on top. And I think for us, you know, there's this huge advantage where we can give, you know, sort of companies a sort of flexibility of, you know, adding different vendors in the space, where we can offer, you know, sort of holistic products around, you know, these sort of areas, which are sort of more standardized, which, you know, an early stage company doesn't need to care about. But as a company starts to mature, they can sort of just turn these products on. For the organizations
[00:13:40] Unknown:
that you're working with in your current formulation of the problem that you're solving, what are some of the general trends that you see in terms of the size or scale of the organization
[00:13:53] Unknown:
and the particular challenges that they're facing that they're looking for assistance with being able to adopt these new capabilities offered by the modern data stack? Yeah. The few sort of trends would stick out. And I think number 1 is, you know, when companies start to invest in data, you know, some of the early stage companies using at the moment, we're sort of fairly broad. You know, our smallest companies are preproduct, and our sort of largest companies, you know, public tech companies. So, you know, at the very early stage, we're seeing more and more companies start to, you know, really invest in this even preproduct, just knowing that, you know, when they do wanna launch, they wanna have the right models in place. They wanna have the right metrics. They wanna make sure that they have collected the rights that they have been sort of collecting the right data in place. So, you know, we're starting to see companies do this way earlier than sort of what we typically saw, which was, you know, a much later stage company. You know, the whole idea of being data driven was an expensive proposition.
You had all of these different tools, which you would sign enterprise contracts. So, you know, arguably, you know, spend a 100 k before you even, you know, build anything. And then, you know, know, hiring data engineers and data platform engineers and analysts, you know, was in general a pretty expensive proposition. You know, it sort sort of used to cost companies half $1, 000, 000 just to get started, you know, with a lot more self-service than, you know, with sort of what we do. The sort of price to do this has become exponentially cheaper. So we're seeing companies do this a lot earlier on. The other trend which which we're seeing is that, you know, is that the hybrid model of working with, you know, offshore engineers is just gonna kind of become a lot more popular. Right? Like, we're seeing companies who have 10, 15 open heads, especially, you know, minus the 5 or 10% of pure tech Silicon Valley companies, which will, you know, pay extremely high salaries and kind of have offices and, you know, all the tech havens. You know, if you remove that, you know, 10% and you focus on the other remaining 90% of companies, we still have the luxury of doing that. You know? I think in general, we're living in an engineering drought for the next 15 years. There are not gonna be enough engineers. And when you narrow that down into data, that sort of gets amplified. Right? So, you know, we're starting to see more and more of that hybrid playbook of how can we focus on, you know, areas which are gonna impact the business more and more, like data science and and sort of data analytics? And how can we offload a lot of the data platform and the core data engineering, ingesting modeling, structuring, you know, sort of basic BI to teams which are just a lot more efficient in doing that? These are 2 of the trends which, you know, we have seen in in sort of being around in this new model for about 8, 9 months now.
[00:16:30] Unknown:
As far as the underlying technologies that you're focusing on for customers that come to you, what are the primary building blocks that you are using to help empower these customers to build out their analytical capabilities? And what are some of the pieces that they're typically coming to you with already in place and just some of the ways that those 2 map together where you have these set of technologies that you will default to in a greenfield environment and the types of technologies that organizations are coming to you with and just how you're able to mesh them together?
[00:17:06] Unknown:
You know, in some ways, a lot of organizations might have, you know, some pieces of it. You know, I think the 5 core layers, and not all of these are sort of relevant for every company, but starting off with data collection where you have your, you know, mixed panel amplitudes or heaps and then ingestion where, you know, something like Stitch or Pipedran and and sort of storage with, obviously, Snowflake, BigQuery, maybe Redshift, you know, modeling with DBT and sort of BI with, you know, the whole range of tools. You know, if you kind of look at these 5 as really the core core stack and then, you know, maybe data lineage, reverse ETL, sort of decision making, machine learning as, you know, the fringe are kind of on top of it. You know, a few of the folks we work with kind of might have, you know, BI tool and a sort of storage layer. You know, what the stack which we really push for is, you know, using 5 Trans Snowflake DBT is really our sort of core stack. If someone is using, you know, an Oracle database or sort of Amazon Redshift, we don't really have any problems with that. But you won't be able to kind of plug into our ecosystem because we'll build a lot on top of, you know, Snowflake. At some point, we'll probably probably add BigQuery over there. But, really, you know, Fivetran, Snowflake, DBT sort of out of the box is really, you know, our core stack. Increasingly, more and more customers are, you know, on Snowflake or, you know, sort of using a b you know, it doesn't matter what data collection or BI tool you use. They kind of play into this. So, you know, what we've seen is that for a lot of companies, we're filling in the blanks if a few pieces are missing.
If they're on a completely different stacks, we've sort of got really good at sort of migrations of kind of how you move to this core stack. And now increasingly, you know, being able to, you know, integrate with, you know, tools like reverse CTLA or sort of decision making with something like Narrator dotai is kind of also you know, more and more companies are just doing this much sooner. So, you know, I think that stack is, in general, becoming broader and broader.
[00:19:05] Unknown:
As far as that integration process of being able to tie together these different technologies, what are some of the sharp edges that you have run into and some of the ways that you've approached sort of smoothing that for your end users to give them a more pleasant and holistic experience of these disparate technologies?
[00:19:26] Unknown:
Yeah. So, you know, I think, number 1, you know, increasingly, companies having more and more PII and having a sort of PII strategy. Right? Like, we no longer want you know, we have a bank using us today and, you know, other financial institutions, and you don't wanna share, you know, someone's credit card details or, you know, any of that financial information. Right? So, you know, how does that kind of process work? Right? And today, we use sort of mass encryption at the sort of column level. So, you know, we can pick up AI fields upfront. And, you know, when the data goes into 5 x, when the data goes into Snowflake, it sort of automatically you only have your security admins that can actually view this. Right? So, you know, that's kind of 1 sort of sharp case, which, you know, increasingly kind of is kind of becoming more and more relevant. I think the other 1 which we haven't, you know, is sort of what we're looking at right now is really the sort of data lineage, sort of data catalog piece. Right?
What's happening that companies are getting more data literate is that the number of datasets inside a company is sort of growing up exponentially. And, you know, I think 1 of the massive pieces which was always missing in the data stack was this concept of sort of, you know, having a a sort of catalog and sort of linear display. Right? So, you know, we're really, you know, integrating this in right now. We wanted to be part of our call stack. Right? So, you know, when you get out of the box, we sort of talk about Snowflake, DBT, and sort of Fivetran. But, you know, potentially having, you know, something like a sort of Marquez in there or, you know, at some point, something like an Ashland in there is an edge case which we're seeing increasingly more, which is probably 1 of our biggest focuses in terms of what's coming next into our platform.
[00:21:00] Unknown:
Yeah. I was gonna ask next about sort of what you see as being the main white spaces and what most people regard as the modern data stack and just some of the missing pieces that need to be more prevalent or become sort of part of the vanguard of technologies that everybody adopts out of the gate versus coming to later on in their maturity cycle?
[00:21:24] Unknown:
I think the biggest 1 out of the box would be around metadata management, which is the broad topic, but sort of linear to catalog. And I think reverse ETL is, you know, we spent the last 5 years making sure that we have reporting outside of application tools. So, you know, you have information in sales calls and marketing tools, you know, sort of users to use Looker or to use, you know, sort of Tableau or or sort of whatever you're using, you know, to look at sort of dashboards there. And finally, you know, going back to the idea of, hey. Actually, you should be able to see reports back in the tools which, you know, we should see reports in the tools which your teams are actually using. Right? So, you know, we all we always saw that as a sort of white gap in the sort of stack. So I think that's sort of another area. And I think, lastly, going back to the sort of security conversation. Right? Like, you know, not having what starts to bite companies very quickly as this scale is, hey. We need to, you know, have a PII in a posture, or we need to be able to do audit or compliance sort of any of those security elements. And then just realizing that you've set up everything in a way which wasn't really built in order to, you know, have different roles and covenants and all of this. Right? So, you know, it's a huge opportunity for us to be able to now provide, especially on the security side, that whole part of the business as a sort of product on top of the stack.
[00:22:45] Unknown:
In terms of the work that you're doing at 5x, because you are relying on these underlying technology platforms, there is a lot that you can take off the shelf. But what are some of the pieces of engineering that you've had to do to be able to present all of this in the unified manner and be able to have a single interface for your customers to be able to access all of these various systems?
[00:23:11] Unknown:
We're still working on sort of making all of this sort of self-service. Right? Like, know, at the moment today, we sort of bundle this sort of platform as well as the engineer because it's not fully automated. We have to kind of go do some manual stuff. Right? So it has to come bundled with the engineer. A few months from now, you'd better go to 5 x into your credit card and, you know, have the stack running in, you know, a few clicks, which will be sort of completely self-service. If you dive into this, you know, this is not just a technology problem. This is not just, you know, how do we go design a stack and sort of spin it up. A lot of it is also around the sort of relationships because, you know, a lot of these tools haven't been built in a way where, you know, we can sort of provision everything from scratch. We have to have a conversation with these vendors. You know, with the sort of Pipedans and Snowflake, it's a little bit easier. But as we get into some of, you know, the other vendors, which are not quite as big, they have to start to sort of open up some APIs, basically, for us to be able to go do this, which, you know, sort of takes time. So that's kind of 1 big piece of, you know, where we're going now. And, you know, I think in the next few months, we sort of will finally be at the point where we'll be sort of pretty self-service.
And, you know, this sort of piece around it is you know, think of it in some ways like the App Store. Right? Like, you know, every iPhone today comes with, you know, calling, texting, and a sort of web browser. In the same way, you know, if you think about using 5 x, everyone will get, you know, ingestion, storage, modeling. And then for all of the other stuff, like, no iPhone comes with a banking app because that's so sort of personal to you. Same way a BI tool, you know, depending on the stage of the business, how much money you wanna spend, you know, how import you know, how many users do you have, You might pick between 1 of 10 different tools out there, all of which, you know, might have their own sort of pros and cons. So we do wanna have a store where we can work with sort of any BI tool. Or, you know, if they're in reverse ETL, if they send us an Hightouch, you know, at some point, we might start with 1. But, you know, hypothetically having both in the store and then letting customers choose what makes sense for them in terms of pricing, in terms of sort of feature set parity, and and sort of all of these areas. Right? So, you know, we have some opinion on the general architecture of the stack. You know, we've gone with the data warehouse architecture. You know, we don't do Spark or Hadoop out of the box. We're not doing Kafka. We've gone with what we call the modern data stack. So we have, you know, a directional opinion, but we wanna be able to give customers the choice of being able to pick what vendors make sense for them and, you know, figuring out how do you stitch it together in a sort of mature way with sort of security and all of these pieces in mind.
And, you know, apart from making it very, very easy for businesses to go, you know, to sort of get on the data stack movement, We also believe this is great for the vendors because, you know, a lot of these vendors are only focusing on the mid market to enterprise. Whereas, you know, if a sort of customer today doesn't have a data engineer, it's very difficult for them to go use Snowflake. You require to have technical expertise to kind of use any of these vendors in the space. I think what sort of 5 x would sort of manage data stack or the, you know, the data platform as a service can do, this opportunity where, you know, if you have the entire ecosystem set up, even a software engineer could go there and start operating it. So, you know, we're reducing the barrier for companies to use the modern data stack. And we think the vendors, the ones who sort of get on board, are gonna really benefit from this. It becomes a new channel for them for user acquisition.
[00:26:44] Unknown:
And to your point of enabling software engineers to be able to own the data experience and analytics workflow, 1 of the challenges that often comes up with that is that very specific considerations around data modeling and long term sustainability of these efforts that at first glance seem you don't like, you don't need to spend a lot of time with it. But as you scale the utility and scale the number of people who are relying on these systems, it does require much more specialized knowledge about some of the ways that data is its own sort of unique experience. And I'm curious what you see as some of the opportunities for the vendors to be able to help automate or mitigate some of those aspects and some of the ways that you're working with your customers to help educate them on those requirements?
[00:27:33] Unknown:
Yeah. Totally. That's a great question. And, you know, in some ways, we actually find it easier if a company has a data team of 1 or 2 people and they've started to encounter some of these things themselves. It makes it easier for us to explain that, hey. There is a best practice in how in how you do it. You know, just sort of starting with a simple example, a lot of these companies, when they sort of get started, they're building a lot of dashboards on top of their raw data. And, you know, what that results in is a massive fan out problem. But you change 1 business metric, you have to go change it in 50 different places. And, you know, that's something which when you start starting out, you don't think about. But when you've, you know, sort of gone into an implementation, you sort of realize how, like, complicated it is to sort of get out of that. Right? Having, you know, an actual business layer which is modeled on top of your raw data. Right? So, know, that's sort of 1 example of it. I think, you know, in general, this sort of concept of, you know, being able to pretrain these engineers on, you know, what is the sort of best practice on how to go operate the stack on how we wanna do data modeling.
And, you know, another sort of use case which we experience very often is this whole concept of single source of truth, right, where, you know, people start to edit data inside the warehouse itself, and all of a sudden, numbers don't match sort of university. Right? So, you know, often enough, you know, businesses have pretty complicated use cases. And, you know, being able to train our engineers on, you know, how to spot these things and, you know, what is the best practice and being able to stick to a sort of single source of truth and how do you sort of architect around it in a way that yet, you know, solve the business use cases of having sort of different systems and be able to sort of reflect accurate sort of data. We solve a lot of this problem by having, you know, a pretty comprehensive training program.
And what's pretty cool about the training program is, you know, everyone is trained on the call stack. But assuming a company now adds 1 of our vendors on it, assuming they add decision making with a with a tool like Narrator, as soon as the company adds that platform onto the stack, the engineer gets an update saying they have a new training module, which they need to go sort of ramp up on. So as soon as the new vendor is on the platform, the engineers have been pretrained on the tool, and they understand the best practices on how to go set up that tool. So because we control the platform, you know, we can really provide, you know, core expertise on how do you set an on sort of how do you use this vendor on top of the platform in a way which is gonna ultimately serve you in this sort of long term.
[00:30:09] Unknown:
Today's episode is sponsored by prophecy. Io, the low code data engineering platform for the cloud. Prophecy provides an easy to use visual interface to design and to deploy data pipelines on Apache Spark and Apache Airflow. Now all the data users can use software engineering best practices. Get tests and continuous deployment with a simple to use visual designer. How does it work? You visually design the pipelines and Prophesy generates clean spark code with tests and stores it in version control, then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage.
Finally, if you have existing workflows in AB Initio, Informatica, or other ETL formats that you want to move to the cloud, you can import them automatically into prophecy, making them run productively on Spark. Learn more at dataengineeringpodcast.com/prophecy. As far as the overall ecosystem of data tools and platform components, as somebody who is spending so much time and focus on the modern data stack, I'm curious what you see as the pieces that are going to have some longevity and continue to be core requirements of the ecosystem and of a healthy platform?
And what are some of the categories of tools or vendors that you see as being ripe for consolidation where maybe the data catalog space is going to subsume the, you know, lineage providers or maybe the data warehouses are going to build in their own lineage and catalog capabilities to obviate some of these dedicated platform tools.
[00:31:51] Unknown:
Taking a step back and, you know, for a second, switching to, like, the business. Right? Like, what are the use cases businesses really care about? Right? And broadly speaking, around 3 areas. Number 1 is on go to market strategy. How do we attract more customers? The sort of second is how are customers using our product, you know, which is mainly around how do we increase engagement, how do we decrease churn. And lastly, you know, how do we optimize our business, sort of decrease cost, right, Sort of automate some of the internal processes so we don't have to kind of put sort of people over there. Right? So, you know, thinking about it from this lens, right, these are the 3 areas which this is why a business is doing data. No one's doing data for the sake of building a data engineering team. They're doing it because it has to kind of go optimize the business. And, you know, getting more customers or getting more money, making sure you're sort of keeping your customers and sort of decreasing your costs are ultimately large reasons why sort of people are sort of doing this right. And and I'm now with this lens kind of flipping this back. I think the areas which have really become more mainstream, you know, especially in the last 1 year, like, when you look at sort of something like sort of data lineage of, you know, ensuring that you know what are your sort of golden datasets in the company, you know, knowing when, you know, when a job fails, what are the downstream dependencies so that you can be proactive and, you know, sort of prevent the business from sort of using stale data. Or when it comes to something like reverse ETL, where, you know, you're able to ultimately enable some of these teams that are marketing or sales or any of these other areas on, you know, what they're sort of looking for. You know, I think the core stack was really, you know, with sort of ingestion and sort of modeling and and sort of storage. It was really around a lot of the infrastructure of you know, it was a lot more at the infrastructure level. And now, you know, with sort of lineage and sort of reverse ETL and sort of areas like sort of decision making or even sort of data mesh, you know, I think these are a little bit more in line with, you know, the business use cases of, you know, of what is needed in order to really sort of drive some of the value for the business. So I think, you know, these layers are becoming more and more sort of mainstream.
And I think, you know, AI and sort of data products and, like, metric stores, we haven't really, you know, integrated any of this stuff into our platform. But I think those are some of the sort of areas which I think will sort of very naturally be the next core parts of the stack.
[00:34:21] Unknown:
In your work with these organizations and helping them come to grips with the modern data ecosystem. What are some of the either points of confusion or areas that you've had to focus on education to help them understand what are the benefits, what are the risks, you know, what are the trade offs of using a fully vertically integrated solution, and just understanding when the modern data stack and the work you're doing at 5 x is the right choice for them?
[00:34:52] Unknown:
I think around 2 areas. Right? I think, actually, people are understanding more and more that, you know, a vertically integrated stack and having different vendors is the right way to go. And I think a big reason of that is, you know, if you look at the consumer space, we're sort of seeing that. Right? Like, people wanna use Slack for messaging and they wanna use Zoom for video. You know, gone are the days where you're using 1 platform, which sort of does everything end to end. Right? So I think people really do resonate with, let's use the best vendor for, you know, 1 use case and, you know, integrate a bunch of them. And I think what we still see is, you know, the sort of we wanna go do data science, you know, sort of point of view where we have companies which, you know, have sort of relatively little or, like, no sort of data experience and talking about hiring data scientists and, you know, more interested in, you know, what are our ML and sort of data science capabilities instead of really looking at it like sort of pyramids. We're speaking at a podcast and, you know, doing a lot more blogs and sort of YouTube videos on, you know, the education of, hey. You know, this is like a pyramid, and, you know, it's like building a skyscraper.
And, you know, this sort of data science and ML stuff is like the penthouse on top of it. And if you just throw, you know, sort of cemented concrete, you can go 5 stories up, but it's all gonna come crashing down. If you're serious about building a skyscraper, you start by digging up the earth instead of going up for some time. So I think, you know, that's really sort of 1 big piece of, you know, the stuff which we do around education. And I think the sort of second 1 is largely on, you know, why do we need all of these different tools? Why can't I really hire an analyst today and, you know, go build a sort of dashboard? Like, why can't I do this in steps? Like, you know, why can't I go add DVT into the stack 6 months later?
And why are we automating ingestion so early on? And, you know, it's interesting because in the software engineering world, you're pretty used to, you know, building something which works for now and 6 months later, rearchitecting it. The sort of problem with the data stack is that, you know, 6 months down the line, you can't tell the business, hold on. I'll be back in 2 months. I'm gonna rearchitect the stack. You require business continuity. You know? Data migrations are way more difficult than sort of software engineering database migrations because it's not just 1 data source. It's, you know, this entire ecosystem and how it all sort of plays together. So it's a little bit tricky to kind of tell to basically teach companies that you actually wanna have the best in class stack on day 1, which is gonna help you scale, you know, for the next 3 or 4 or 5 levels instead of incrementally adding pieces.
[00:37:32] Unknown:
On the marketplace side of your business where you're working with engineers to help the customers that you're working with, I'm wondering what are some of the skills and capabilities and useful background that you are looking for for people to be able to help sort of grow your business, scale your customers' analytical capabilities, and some of the ways that you're approaching the onboarding of these engineers to be able to help with growing the analytical capabilities of these companies?
[00:38:08] Unknown:
Yeah. Absolutely. So, you know, we're built a fairly comprehensive process. Right? Like, most companies, you know, still sort of hire sort of based on sort of syntax based hiring. Right? Like, how do you solve this 1 particular use case or, you know, what is the answer to sort of this question? Right? And 1 of the first things we developed is a pretty automated process. Like, you know, our core test is we give companies so we give engineers access to, you know, 10 different data sources, and they have, you know, 6 hours to go ingest this to go model it, to structure it, clean it, sort of build out reporting. Right? So very, very sort of core of what does a data engineer actually do. Right? Kind of responsible for 3 core areas, responsible for, you know, ingestion, responsible for modeling, responsible for, like, BI. Right? So we sort of test those core, you know, abilities at the sort of very get go. We do, like, psychometric testing to, you know, ensure that, you know, we sort of can avoid red flags as much as possible. We're looking for someone with high independence, high communication skills.
We're looking for someone who, you know, works well with others but can also kind of, you know, function pretty autonomously. That's kinda what we're looking for engineers. And, you know, when we train them, there are only 2 parts to our training process. Right? The first 1 is just on, you know, our call stack, you know, sort of working with Snowflake, working with Fivetran, you know, the sort of best practices. We partner a lot with these companies, have used a lot of their training content on, you know, being able to teach sort of best practices. I think the second piece is, you know, on working with clients. Right? You know, implementing a sort of customer majority matrix of kind of being like, hey. When you have, you know, a sort of customer who's extremely mature on tech side, you know, sort of how do we work with them? And over there, you know, we're talking a lot more about being able to, you know, integrate code into Git and being able to sort of, you know, integrate into their project management tools. And, you know, how does that work compared to a much more earliest age customer, which for the most part wants us to lead a lot more, you know, sort of start with a very, very sort of generic business requirement and how do we go from a a sort of super generic business requirement to a sort of technical dashboard. Right? So, you know, we spend a lot of time kind of teaching our engineers.
I wouldn't really call them soft skills, but, really, you know, what we have learned with sort of working with customers over the last 7 months and kind of how do we sort of start to, like, standardize more and more of this upfront.
[00:40:43] Unknown:
As you are working with these engineers who are onboarding, what are some of the challenges that they're dealing with either in terms of coming to terms with the technologies or figuring out how best to engage with the organizations to help them reach their goals and just some of the points of friction that exist at each of the different stages between the various players in sort of your experience that you're trying to provide?
[00:41:11] Unknown:
I think, you know, anytime you sort of speak about, you know, people, it's a completely different set of problems from sort of platform perspective. Right? Like, people are not products. People are sort of human beings, and, you know, each of them are different. It's much, much harder to sort of generalize. Right? So, you know, it's a completely different, you know, sort of having a sort of marketplace on top of a platform, you know, presents its own sort of complete sets of challenges and, you know, being able to find the right engineer for a sort of customer.
We're sort of learning a lot over there on, you know, on the whole matching process on, you know, what we need to be able to sort of demonstrate in the 1st few weeks, you know, how these sort of things evolve, how to keep sort of folks accountable. You know? Interestingly enough, if I had to kind of just kind of pick 1 area, you know, which is sort of coming across both, is is engineers also asking us, you know, when can we go start doing data science and sort of deploy models and recommendations for these sort of customers? And, you know, we have nothing against it. You know, at some point, we would love to be more integrated in it, but sort of realizing that for majority of businesses, there's so much more value in just getting much, much, much better at the core stuff than attempting to do any, you know, hacky data science stuff at the moment. So, you know, it's more on how do we commoditize in some ways a lot of the data platform and the data engineering stuff so that companies can ultimately focus more on the data science stuff. And that's really kind of what we're after.
[00:42:47] Unknown:
In your work with customers and with the engineers that you're engaging with and just the overall data ecosystem, what are some of the most interesting or innovative or unexpected types of problems that you've been approached with or that you've helped to find solutions for in the work that you're doing at 5 x?
[00:43:07] Unknown:
So much stuff which, you know, we learn sort of every day. Right? Like, 1 of the companies which we're working with is very interested in being able to match leads to sort of sales reps to sort of figure out, hey. You know, instead of just randomly assigning it, you know, is there a way which we can maximize our sort of conversion rates by identifying what sort of sales reps are sort of better at what sort of leads. So, you know, that's a pretty, you know, interesting sort of use case, which sort of actually the teams are sort of looking in sort of pretty recently. You know, we've had other use cases where, you know, by sort of starting to work on a problem, the business actually realizing that, hey. There's a whole new opportunity in the market to basically launch a new product, which makes it, you know, infinitely easier to go search for, you know, information around a particular domain because we were experiencing troubles, you know, being able to effectively model that data sort of from a data modeling perspective. So sort of realizing that, you know, if they could solve this problem, this was a huge opportunity for them to sort of take these products to market.
So, you know, can't really apart from a few of the customers which we've done case studies with, being a little bit more cautious of not getting too deep in terms of, you know, who these customers are and so and so what they're trying to do. But it's been really interesting. You know, ultimately, we are really focused on adding business value into these customers. Again, going back to what I said earlier is no one's doing data for the sake of building a data platform or hiring data engineers. They all are doing it to, like, optimize the business. Right? And as an overlying you know, why we sort of, you know, exist as 5 x is ultimately, you know, to make it easier for these companies to really add business value. So it's been really interesting to see how customers are sort of using these capabilities, which we've built to sort of enhance their businesses.
[00:44:59] Unknown:
As somebody who has done a decent amount of consulting in the past, 1 of the things that's always challenging is being able to understand how to quantify the value that you deliver. And given that that is sort of the core focus and the core motivator for what you're doing, I'm wondering how you think about the conversations at the outset about, you know, what it is that you're going to provide to the customer and some of the ways that you measure success in the work that you're doing?
[00:45:28] Unknown:
Yeah. That's a great question. I think that'll keep evolving, but we don't call ourselves as sort of consulting companies. We don't do discovery calls. You know, we don't provide statements of work. You know? We're basically providing you a sort of platform and, you know, a sort of marketplace to embed really high quality engineers into your business. Right? So we look ourselves as a long term solution towards your core data platform and data engineering needs. So, you know, we typically, you know, have customers list down, you know, top business goals. Like, what is that use case which could potentially add a 0 onto your revenue? We sort of pick 1 of them and, you know, scope that up, not a lot, and, you know, have the engineers really focus on a single use case to begin with. So, hey. How do we go set up the entire platform and then go focus on this 1 single use case in the 1st 4 to 6 weeks?
And, you know, from there, be able to, you know, take this to the entire parts of the business. You know, we've been fortunate enough that most companies are just really impressed by how quickly they start to get meaningful data. And a lot of that comes from the fact that we don't set up infrastructure from scratch, and our engineers, you know, are trained on how to, you know, get started out of the box. So, you know, on week 1, you know, know, once we have a deep web doc and kind of figure out what the first use case is, we're already ingesting data into the platform. Right? And by week 3, they're already starting to see sort of dashboards and sort of business output. So, you know, we've been fortunate enough that sort of customers are seeing the value in sort of what we deliver, and especially the ones who might have had data teams in the past to know how slowly you know, how long it took for them to, you know, be able to even, you know, cobuild their first dashboards compared to how fast it's sort of taken us. So, you know, we haven't really done a lot of inside the consulting world, a lot of work around use the value we added, and, you know, this is how much it would have cost. It's been more organic for us so far.
[00:47:26] Unknown:
And so in your experience of starting the 5x business and growing it and starting to bootstrap the marketplace ecosystem and working with your customers, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:47:41] Unknown:
It's been really interesting to work with sort of companies across all of these different industries and different sizes and different sort of revenues. And in some ways, realizing how sort of different they are, you know, how they work and how a lot of these sort of processes are sort of different. And in some ways, we're trying to sort of standardize the stack, you know, how we do things. So, you know, there have been some pain points on how do we sort of standardize more and more. But on the flip side, what's been really interesting is kind of going back to what I said earlier in terms of business use cases. You know, it doesn't matter really what industry, what size, you know, how sophisticated a company really is. You know, ultimately, it's pretty simple to map back what they're trying to do towards 3 sort of to, like, 3 broad areas, right, which is go to market, how they're using the product, and sort of how do you optimize internal operations. You know, in some ways, it's a lot of customization per industry, per company, but at the same time, all of this customization is around, you know, a few core areas and not as broad and wide as a lot of people make it out to be. I mean, obviously, you have outliers who, you know, businesses which are focused on, you know, extremely niche use cases. But, you know, for the most part, most companies sort of really start with these takeaways.
That's a horribly hard question to answer by the way.
[00:49:02] Unknown:
That's why I ask it. And so for organizations who are struggling with their own internal capacity for building analytical capabilities. What are the cases where 5 x might be the wrong choice, and they're better suited with just going down the path of hiring internal engineers or working with a fully vertically integrated solution or some other sort of consulting engagement?
[00:49:27] Unknown:
You know, I think for the most part, the modern data stack movement encompasses, you know, 80, 90% of use cases. But there might be niche use cases, especially, you know, if you are trying to do something which is super real time, you know, using something like Kafka and downstream processing might be a lot more relevant for you. Or, you know, using Spark or Hadoop for some of the use cases it's extremely good at, like finding a needle in a haystack, which are, you know, amazing use cases for something like Spark, then, you know, obviously, going down that architecture is probably gonna be more valuable for you.
I think their, you know, percentages you know, there's, like, 10, 15% of, you know, pure tech companies, which are gonna always wanna, you know, hire their own teams in house, you know, and sort of work in their own box, then, obviously, you know, a solution like us won't make sense. But I think, you know, for the vast majority, for, like, 80% of companies, a solution like us could be really interesting.
[00:50:28] Unknown:
As you continue to grow and scale the work that you're doing at 5x, what are some of the things you have planned for the near to medium term? I think our biggest focus right now is, you know, being able to offer our platform completely self-service.
[00:50:41] Unknown:
You know, I think that's gonna be the sort of biggest thing. And then, you know, ultimately, being able to have more of a more of a sort of ecosystem where we can integrate with, you know, more and more vendors who can sort of come play into our platform. We sort of integrate with this, and we can provide a sort of sales channel for them and ultimately provide kind of more value for, like, businesses. And I think on the engineering side, we're really sort of focused on India so far. We are in the process of sort of going global. We're about to open South America. So the idea that you'll be able to hire 2 engineers, 1, you know, in Mexico or, like, 1 in India and, you know, have around the clock support in terms of sort of core data engineering stuff. So, you know, being able to sort of really have more and more options in terms of the talent needed for your platform, that's kind of coming up there. But I think the biggest, biggest, biggest focus right now is to launch a self-service version of our platform.
[00:51:35] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling our technology that's available for data management today.
[00:51:50] Unknown:
I think the biggest gap is that the space is so fragmented, and this is not, you know, any kind of pitch. This has got nothing to do with 5 x. You know, I think, you know, the sort of big, you know, Amazon and Google are not really competitive when it comes to the, you know, sort of holistic data stacks. So, you know, there isn't a really good option to, you know, have, you know, an end to end offering if you sort of wanted 1. Right? There are a few companies which, you know, have small bits and pieces of it. But, you know, especially when you look at how, like, quickly the space is growing and how many new categories have been added, it's honestly been a little bit surprising
[00:52:29] Unknown:
to think that, you know, isn't something which sort of brings it all together. So I think that was a huge gap. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at 5 x and your perspective on the modern data ecosystem and some of the ways that it is able to help accelerate time to value for different organizations. So I appreciate all the time and energy you're putting into that space, and I hope you enjoy the rest of your day. Thanks, Tobias. Thank you so much for having me in the show, and hopefully, we were able to add some value to your audience. For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at data engineering podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Guest Introduction: Tarush Agarwal
Career Journey and Early Experiences
Founding 5x and Modern Data Stack Adoption
Defining the Modern Data Stack
Market Shifts and Trends in Data Management
Challenges and Solutions for Data Integration
Security and Compliance in Data Management
White Spaces in the Modern Data Stack
Educating Customers on Data Best Practices
Core Technologies and Vendor Relationships
Skills and Training for Data Engineers
Interesting Use Cases and Customer Success Stories
Lessons Learned and Future Plans for 5x
Biggest Gaps in Data Management Tools
Closing Remarks and Contact Information