Data Management Trends From An Investor Perspective - Episode 136

Summary

The landscape of data management and processing is rapidly changing and evolving. There are certain foundational elements that have remained steady, but as the industry matures new trends emerge and gain prominence. In this episode Astasia Myers of Redpoint Ventures shares her perspective as an investor on which categories she is paying particular attention to for the near to medium term. She discusses the work being done to address challenges in the areas of data quality, observability, discovery, and streaming. This is a useful conversation to gain a macro perspective on where businesses are looking to improve their capabilities to work with data.

Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!


Springboard LogoMachine learning is finding its way into every aspect of software engineering, making understanding it critical to future success. Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their Machine Learning Engineering career track program. In this online, project-based course every student is paired with a Machine Learning expert who provides unlimited 1:1 mentorship support throughout the program via video conferences. You’ll build up your portfolio of machine learning projects and gain hands-on experience in writing machine learning algorithms, deploying models into production, and managing the lifecycle of a deep learning prototype.

Springboard offers a job guarantee, meaning that you don’t have to pay for the program until you get a job in the space. The Data Engineering Podcast is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there’s no obligation. Go to dataengineeringpodcast.com/springboard and apply today! Make sure to use the code AISPRINGBOARD when you enroll.


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar to get you up and running in no time. With simple pricing, fast networking, S3 compatible object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • You listen to this show because you love working with data and want to keep your skills up to date. Machine learning is finding its way into every aspect of the data landscape. Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their Machine Learning Engineering career track program. In this online, project-based course every student is paired with a Machine Learning expert who provides unlimited 1:1 mentorship support throughout the program via video conferences. You’ll build up your portfolio of machine learning projects and gain hands-on experience in writing machine learning algorithms, deploying models into production, and managing the lifecycle of a deep learning prototype. Springboard offers a job guarantee, meaning that you don’t have to pay for the program until you get a job in the space. The Data Engineering Podcast is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there’s no obligation. Go to dataengineeringpodcast.com/springboard and apply today! Make sure to use the code AISPRINGBOARD when you enroll.
  • Your host is Tobias Macey and today I’m interviewing Astasia Myers about the trends in the data industry that she sees as an investor at Redpoint Ventures

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by giving an overview of Redpoint Ventures and your role there?
  • From an investor perspective, what is most appealing about the category of data-oriented businesses?
  • What are the main sources of information that you rely on to keep up to date with what is happening in the data industry?
    • What is your personal heuristic for determining the relevance of any given piece of information to decide whether it is worthy of further investigation?
  • As someone who works closely with a variety of companies across different industry verticals and different areas of focus, what are some of the common trends that you have identified in the data ecosystem?
  • In your article that covers the trends you are keeping an eye on for 2020 you call out 4 in particular, data quality, data catalogs, observability of what influences critical business indicators, and streaming data. Taking those in turn:
    • What are the driving factors that influence data quality, and what elements of that problem space are being addressed by the companies you are watching?
      • What are the unsolved areas that you see as being viable for newcomers?
    • What are the challenges faced by businesses in establishing and maintaining data catalogs?
      • What approaches are being taken by the companies who are trying to solve this problem?
        • What shortcomings do you see in the available products?
    • For gaining visibility into the forces that impact the key performance indicators (KPI) of businesses, what is lacking in the current approaches?
      • What additional information needs to be tracked to provide the needed context for making informed decisions about what actions to take to improve KPIs?
      • What challenges do businesses in this observability space face to provide useful access and analysis to this collected data?
    • Streaming is an area that has been growing rapidly over the past few years, with many open source and commercial options. What are the major business opportunities that you see to make streaming more accessible and effective?
      • What are the main factors that you see as driving this growth in the need for access to streaming data?
  • With your focus on these trends, how does that influence your investment decisions and where you spend your time?
  • What are the unaddressed markets or product categories that you see which would be lucrative for new businesses?
  • In most areas of technology now there is a mix of open source and commercial solutions to any given problem, with varying levels of maturity and polish between them. What are your views on the balance of this relationship in the data ecosystem?
    • For data in particular, there is a strong potential for vendor lock-in which can cause potential customers to avoid adoption of commercial solutions. What has been your experience in that regard with the companies that you work with?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:11
Hello, and welcome to the data engineering podcast the show about modern data management. What advice do you wish you'd received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I'm working with O'Reilly Media on a project to collect the 97 things that every data engineer should know and I need your help. Go to data engineering podcast.com slash 97 things to add your voice and share your hard earned expertise. When you're ready to build your next pipeline or want to test out the product to hear about on the show, you'll need somewhere to deploy it. So check out our friends over at linode. With their managed Kubernetes platform, it's now even easier to deploy and scale your workflows. Or try out the latest Helm charts from tools like pulsar to get you up and running in no time with simple pricing fast network s3 compatible object storage and worldwide data centers you've got everything you need to run a bulletproof data platform. Go to data engineering podcast comm slash linode. That's l i and o d today and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. You listen to this show because you love working with data and want to keep your skills up to date. Machine learning is finding its way into every aspect of the data landscape. And springboard has partnered with us to help you take the next step in your career by offering a scholarship to their machine learning engineering career track program. In this online project based course every student is paired with a machine learning expert who provides unlimited one to one mentorship support throughout the program via video conferences. You'll build up your portfolio of machine learning projects and gain hands on experience in writing machine learning algorithms, deploying models into production and managing the lifecycle of a deep learning prototype. springboard offers As a job guarantee, meaning that you don't have to pay for the program until you get a job in the space. The data engineering podcast is exclusively offering listeners 20 scholarships a $500 to eligible applicants. It only takes 10 minutes and there's no obligation. Go to data engineering podcast comm slash springboard today and apply. Make sure to use the code AI springboard when you enroll. Your host is Tobias Macey, and today I'm interviewing Astasia Myers about the trends in the data industry that she sees as an investor at redpoint ventures. So Astasia, can you start by introducing yourself?
Astasia Myers
0:02:33
Hi, yeah. I'm Astasia Myers. I'm part of redpoint ventures, early stage team focusing on enterprise. Thanks so much for having me today. Tobias.
Tobias Macey
0:02:42
Yeah. Do you remember how you first got involved in the area of data management or working with data companies?
Astasia Myers
0:02:47
Yeah, it's actually pretty cool. I started my career in sell side equity research covering publicly traded enterprise companies. So I actually covered Seagate, Wd netta EMC Of all the big players, this was an exciting time for storage. It was the era when all flash arrays were starting to come on the scene and software defined storage was all the rage. You know, Pure Storage at that time was still private. And you know, digging in as an equity researcher really got me familiar with the world of storage and data management. Interestingly, I then transitioned to Cisco, where I was on the m&a and venture investing team supporting the core business units of servers and networking. And we were spending a lot of time analyzing the storage market and did quite a few investments actually in that space that I led so very proud of leading the series C and Kochi city, which is now a unicorn in the backup and risk recovery space. And we also invested in data that rubric acquired in spring path that Cisco bought and then also elastic file that Google more recently acquired. And since my time at Cisco, I transition to red points, early stage team and I continue to look at data and ml folks Started everything from New databases to ETL to ml tooling. And I also share a lot of the research on the subject on my medium blog and twitter account for others to learn more about the category.
Tobias Macey
0:04:12
And before we get too much further into more of your background and the ways that you keep up to date with the industry, can you give a bit of an overview of what redpoint ventures works on and your role there?
Astasia Myers
0:04:22
Yeah, of course. So redpoint is a Silicon Valley based VC firm that's been around for about 20 years. We currently have two funds, a venture fund and an early Growth Fund. I sit on the venture team, it's a $400 million vintage, we are quite enterprise leaning. So b2b investments represent about 80% of deployed capital and we invest seed series A and Series B out of that fund. And we've had a long history of investing in data companies. we've deployed over 250 million in capital over the last few years and those businesses have been honored to partner Are with startups like snowflake, Looker, cockroach DB D graph, true storage, serial dromio, and spring path. So I've been doing it for quite a few years, and continue to think there's opportunities in the space, it's definitely not slowing down. My role in the team is enterprise focused investment professional work with a handful of enterprise businesses today, um, three are in the data space one we've publicly announced called serial, which is in the data layer, access, control and visibility, and then two others in the data infrastructure world. So love all things data and looking to speak with startups in that space.
Tobias Macey
0:05:36
From your perspective, as an investor and somebody who's working with these companies, what is it about the overall category of data oriented businesses that you find appealing or attractive?
Astasia Myers
0:05:46
Yeah, there's, there's a few different factors. You know, one thing that we like is these markets are absolutely enormous. So according to EDC, big data and business analytics market is around 190 billion in annual spend and continuing to grow at a double digit clip and should be close to 275 billion in just two years. So absolutely enormous. To put that in perspective, IDC for the same year says Information Security spend is only 120 billion. So big data is huge, you know, 65 billion larger in terms of scale. And it's not just the overall market. If you look at individual subcategories, you know, you have bi and analytics, it's about 20 billion you database management, that's 50 data warehouses that are about 20 billion, just huge categories that we rarely see in enterprise. The second thing is because they're big categories, there's a precedent of large outcomes in all the subcategories so if you look at bi you have Tableau and Looker and click and thought spotted Domo databases of course, Oracle SAP we've seen Mongo and Cloudera did a management to code Informatica log management, Elasticsearch. Splunk. So it's awesome you can as an investor, you can point to businesses that have had really successful exits and become enduring companies, which is what we look for. There's a few other things that we like in terms of the market dynamics. These subcategories are large. But there's also a winner take a lot or a winner take most dynamic, which is excellent. So when I was covering publicly traded companies like EMC, they had the largest market share is around 35%, which is pretty impressive for any category. So really an oligopoly style of market here. And then the tech, you know, it's really hard to build this type of technology, differentiation matters and is felt by the user. And this technical differentiation can be a defensible moat for the business over time. And because of that, finally, you know, these products are really sticky. It's core infrastructure. Once the technology is is adopted, there's a moat since its friction to rip it out. And this means long contracts that are potentially larger, all things we'd like to see.
Tobias Macey
0:08:03
And in terms of the information that you rely on for being able to keep up to date on what's happening in the data industry and understand what's relevant, and what businesses are going to perform well, given the overall economy and the overall environment that they're working within and building on, what are those types of information that you look at? And how do you gain that understanding?
Astasia Myers
0:08:26
Yeah, so as you can imagine, there's really no one source we go to it, we take a mosaic approach to our research. So you know, everything from podcasts like data engineering, and I've been a longtime listener and so honored to be here today, but also newsletters like O'Reilly at data Council, or even from the public cloud vendors that talk about their, their new product releases. Social media is a great outlet. So Reddit and Twitter before we'd be going to events like strata and meetups around open source technologies and you know, my personal favorite because came from academia and sell side research is speaking to operators and buyers. You know, these calls are often the ones that I had the most fun in. During my day, they're just my favorite because you know, it cuts out the noise and the hype. It goes straight to the people who are working with these technologies and thinking about their architectures. And you know, we actively engage operators and have a network. For anyone who's listening who operates in the data space, please email me and we'll be outage adding you to our community events and engagement. We think that the insights from the operators the people on the ground are fundamental and how we get information and make smart decisions.
Tobias Macey
0:09:36
And then within those different information sources, it can often be difficult to pull out the useful signal from the noise because of the variety of ways that it's being represented and potential biases in terms of how things are presented. So what is your personal heuristic for determining the relevance of any given piece of information and factoring that into your overall decision as to whether or not a given company Our category is worthy of investment or further investigation.
Astasia Myers
0:10:03
Yeah, it's funny, we're on a show that talks about data and data is information. And like everyone, we're inundated with information all the time. You know, we're really looking for a needle in the haystack, any tidbit of information that we read can actually change how we're thinking about the world or what we're doing in a single day. Right? If I come across something fascinating, and you spend a few hours trying to investigate further, so you're totally right, you know, I look for indicators and information suggest three things, novelty game changing, and this is the future. You know, novelty is pretty clear. Is this new, original or unusual? Have I heard about this before? We see so much information every day that just being novel can be a big deal game changing? Is there sentiment around the information? Is this considered groundbreaking? Does this change someone's process or daily activities are their role? And then the future we kind of think about is this something that most people We'll adopt it implement kind of become ubiquitous in a few years. And so if the information that I'm reading suggest those three things, it's pretty exciting for us and we dig in further.
Tobias Macey
0:11:10
And so as somebody who works closely with a variety of different companies across different verticals and areas of focus or problem domains, what are some of the common trends that you have identified in the overall data ecosystem that you're keeping an eye on and that you're spending your focus on in terms of identifying potential up and comers that are worthy of investment?
Astasia Myers
0:11:29
We're seeing three trends today. One is kind of flexibility around delivery models of the technology. Two is the data security matters more now than ever before. And this could be data security solutions or data security in products. Three is increased adoption of open source. So in terms of the flexible delivery models, you know, previously, everything was on premise. Now it can be in private cloud, public cloud or a system that sits across environments. Additionally, a lot of the slides Aleutians can be in a fully hosted SAS, which we all know about. But we're starting to see the emergence of what we call cloud Prem. And the idea behind cloud Prem, which is sometimes called VPC, for virtual private cloud, is that it's a new architecture splits the SAS application into code and data. What's interesting about this is that the SAS company writes, updates and maintains the code, and the customer manages the data. So an example of this would be the control plane is in the cloud, but the data operators are in the private public cloud VPC. And this is one of the biggest changes we've seen recently. Customers are adopting this approach for a few different reasons. The first is cost. So the cloud is more expensive for large enterprises, many are considering moving back to their own infrastructure. And using cloud only to manage first this is kind of what we saw originally, almost 10 years ago. And so this architecture couldn't be more cost effective for these buyers. The second is control over data and access management. You know, this is often the crown jewels of the business. And the third is compliance, you know, we're seeing more regulation than ever before with ccpa and GDPR. A lot of business want to be better prepared and structure, their data, and how they engage with software applications more centrally, for these reasons. And what's interesting about this new cloud Prem approach is that it really changes the customer controls of the data, and therefore the power dynamics with the vendor. So when the customer controls the data, they can reshape it, move it, share it with a competing vendor, and the customer can create their own applications on top of the data. And the integration between applications no longer has to be at the application layer with Salesforce or Marketo API's, it can now be done at the database. So it's really interesting the Unlocking Potential with this new delivery model and the flexibility that buyers have now for deciding what's best for them. The second area, as I mentioned earlier is really data security. You can see this in the cloud Prem model. People want to control the data themselves for compliance. Sock two is becoming a requirement for very early stage startups to sell into the enterprise. It is becoming the standard and simply no one wants to be in the news because of a data breach. You know, they've been very high profile from Capital One to Marriott. No one wants that. So data security is key in these platforms. And then the third is increased adoption of open source you know, open source has existed in the data layers for a long time, Postgres MySQL, but now we're seeing it touch nearly every type of database from graph to analytical. We see it moving up the stack to data catalogs, processing engines, ETL and data quality. While you don't need to deliver us data software is open source, you know, snowflake which we're in In as a clear example of this, there is movement in the space towards open source technology. People want to see the code, check it out the tech for themselves, see if it works in their environment and adopted if it does without dealing with a sales team and open source allows them to do that it gives them the flexibility to be less reliant on the vendor, potentially decrease costs and increase control.
Tobias Macey
0:15:21
Going back to your point about the private SaaS model of deploying the actual service into the customers environments. There have been a few different people that I've spoken with as well that are taking advantage of that. And it's definitely an interesting delivery model. Notable examples being data, coral, and snowplow. And there's also another company called chaos search, all of which allow you to store the data in s3 or wherever your own applications are. And then they'll deploy the software to your environment and manage it but you still, as you said, have control over the overall lifecycle of the information and you don't have to worry about handing it off to a third party and then requesting access To get it back are things like Google Analytics. And so for data in particular where it does have that gravity and it does have the increased cost of moving it to and from different environments, then as you said, it's a cost savings as well as a control element.
Astasia Myers
0:16:14
Totally. It's really exciting to see this new architecture, we actually started seeing this emerge about two years ago, not in the data space. But businesses that exemplify this matter most, which is an open source, alternative Slack, and bit Warden, which is an open source Password Manager, they have these models. We have an entire thesis around investing in companies in this space, both of the two that I just mentioned, are in our portfolio. And we think it's really exciting. You know, with increased regulation, higher costs for the public cloud. This gives buyers an opportunity to safely adopt technology that meets their regulatory requirements.
Tobias Macey
0:16:55
And so you also read an article recently that was highlighting the four main trends that you're keeping an eye on for 2020 in the data space, and those call out in particular, the elements of data quality data catalogues observability of the influences for critical business indicators or KPIs and streaming data. So taking those in turn, starting with the data quality aspect, what are some of the driving factors that influence that quality? And what elements of that problem space are being addressed by the companies that you're watching?
Astasia Myers
0:17:28
Yeah, so to make sure everyone's on the same page, Data Quality Management ensures that data is fit for consumption and meets the needs of data consumers to be high data quality data must be consistent and unambiguous. You can measure data quality through a few different dimensions, accuracy, completeness, Tegrity, validity, among others, and there really isn't one factor that can cause data quality issues. factors that influence data quality include data capture, integration transfer, And management capturing the wrong data or excessively collecting it can lead to shortcuts for reporting so there could be bad data quality. data quality issues are often the result of database merges or systems and cloud integration processes in which data fields that should be compatible are not due to schema or format and consistencies, or unclear field definitions. You know, at the basic level manual steps of data entry and manipulation can cause problems. And finally, know there's a fragmentation of information systems, leading to bad migrations or data duplication so that data can become stale and out of date, which is also a form of bad data quality. And then fundamentally, you know, data can be corrupted, or there can be changes in source systems that can lead to bad results. The companies that we're tracking can be categorized across two vectors. One is internal versus externally generated data, and two is in motion or at rest data quality, evaluate So, with regards to the first sector, you know, we can think about third party data that businesses in verticals like finance, real estate and healthcare adopt, they're ingesting this third party data to inform their systems. And often the data has not been cleaned and prepped by the vendor. And so they need to adopt technology to make sure the data fits well. And this is very different than internally generated data like customer data that we'd get in tech and e commerce and CPG, where they need to look the data, their systems are organically generated. And then the second factor is around evaluating data quality for data in motion versus at rest. So in motion tying into Kafka streams, or pulsar to augment bad data in real time, before it reaches the sink, or at data at rest, there's solutions out there that can scan databases for no values, inconsistent formatting, or changes in the distribution of data so you can make sure that the current data you're seeing out Rest mirrors, historical distributions to make sure there aren't any issues.
Tobias Macey
0:20:05
And so in terms of the overall data quality landscape, what are some of the unsolved areas that you see as being viable options for newcomers or new businesses to try and tackle and that businesses would be able to gain value from and are actively looking for?
Astasia Myers
0:20:22
Yeah, what what excites me about data quality is that it's foundational to businesses human machine decision making, you know, dirty data can result in incorrect values and dashboards and executive briefings. It's kind of crazy. We've heard about bad data leading to product development decisions that can cost corporations millions of dollars in engineering effort. And then you know, with machine made decisions based on bad data, it could lead to bias or incorrect actions that could create a bad user experience. We've come across a few startups and open source projects operating this space soda Toro Monte Carlo grid expectations DBT at Nexus data, it's kind of the Wild West in terms of data quality. It's an idea that's been top of mind for senior leaders for the past few years. But there really haven't been great tools out there to solve it. Some teams have built systems internally to identify data quality, but there hasn't been a platform that's emerged just yet. Most of the startups I mentioned, have only been around for two to three years or early in their journey. So we think there's a lot of opportunity in the space as this is a top priority for senior leaders.
Tobias Macey
0:21:35
And another element that ties into the overall data quality question is the idea of discoverability of the information and being able to track its origin and its lineage to ensure that the processes that are being run on it aren't aligning important information or introducing inaccuracies or older bias data. And that is a big portion of what's covered in the overall concept of data catalogs or metadata management. And I'm wondering what you're seeing as being the main challenges that businesses face in establishing and maintaining those data catalogs and being able to have robust mechanisms for managing all that metadata.
Astasia Myers
0:22:14
Yeah, data club catalogs are super interesting because they capture rich information about data, putting the application context, behavior and change in the lineage. As you noted, it's pretty neat technology because it supports self serve data access, empowering individuals and teams, so that they don't actually have to work with it to receive data and they can discover it themselves what's relevant, and this actually helps improve productivity of ml and data scientists teams. The other thing we like is, as you noted, they can address PII, they can discover it, and so you can do controls on who can access PII data. Some of the challenges faced by businesses establishing data catalogues is the implementation as you can Imagine there's fragmentation of data across different silos, databases storage layer. Sometimes in Excel, there are many resources you need to tie into. This could be hard to implement a solution. And second is really around User Education adoption. When we talk to buyers, people often say that, theoretically, they understand the value of a data catalog because the team no longer needs to work with it, which can be a bottleneck for data access. And they can actually get fresher data by having a self serve model. But often, we hear that these individuals, you know, have to experience a data catalog themselves to fully appreciate the value. And I think that's why we're starting to see now the emergence of many different players. It's taken a few years to frame the value and gain visibility and now teams are starting to adopt it.
Tobias Macey
0:23:53
And there are, as you mentioned, a few different entrance to this market, some of which are fairly well established. The one that comes to mind most readily is elation. But there are also a number of open source options, the one that comes to mind are Amundson from Lyft, or the data hub project from LinkedIn. And I'm wondering in terms of the available options that do exist, what you see as being the overall shortcomings in those products, and that might inhibit their adoption or make them complex to implement and what are some of the overall approaches being taken by some of the other companies that you're keeping an eye on who are trying to solve this problem?
Astasia Myers
0:24:28
Yeah, you're right. It's interesting to see that elation and cleaver have been around for the wot a while they're closed source enterprise oriented products. And there's been this new emergence of open source projects from Lyft, LinkedIn, Netflix and then other large businesses like Airbnb and Uber have built their own and publicly talked about it but not open sourced it just yet. The ways we look at the different types of technologies is first you know, closed source versus open source. You know, we are agnostic to that approach, but then also What data sources do they ingest from the stack they use and the functionality that they support? You know, there's a broad range of functionality everything from looking at sample rows, data profiling, freshness, metrics, ownership, top users and queries and lineage in addition to the fundamentals of understanding schemas and metadata. And then in terms of stacks you know, you have Amundson, which is Python node and uses databases like Neo for j s and elastic well, meta cat is job in elastic only based and then finally, in terms of data sources, you know, this is how it could get implemented in environment. For those that are using airflow for dag orchestration Amundson is has a Python library to integrate at that point. Other solutions like LinkedIn data hub tied directly into presto and MySQL and Oracle via API calls or Kafka events. So it really depends On a few factors, as I noted the breadth of the functionality that you're hoping to get from your data catalog, the stack that you are familiar with and comfortable with adopting. And finally, what are your data sources and your perspective of how best to integrate a data catalog if you're not using airflow Amundson may not be the best choice for you. A lot of businesses are using airflow, so it's a great option. It just really depends on your local environment. And I think that's why we see so many different offerings in the space today.
Tobias Macey
0:26:30
Another piece that isn't specifically a data catalog, but that I was impressed by who I spoke with a while ago were the folks behind the Marquez project out of we work as a means of being able to have useful integration points for automatically populating the metadata information and being able to visualize the overall lineage of the processes that produce the end results.
Astasia Myers
0:26:52
Yeah, it's a really interesting project that came out of work, I think foundationally for all of these data catalog. It needs to be a seamless implementation into the environment. As I said, based on the data sources, or orchestration layer that you're using, all the data should be pre populated, it should have freshness, it should be tied to a data steward, which could be auto generated based on who is collecting the data are tying to your LDAP systems, as you can know who should be accessing the data. One of the things that I think is most powerful is the search functionality. And some of these platforms for typing in a keyword and they auto discover tables based on relevancy people within your network and the owner so you can have the freshest data available.
Tobias Macey
0:27:39
And moving to the overall idea of observability of KPIs and the different factors that influence them and being able to dig into those different elements. What are the overall capabilities that are necessary in those types of systems and what is lacking in any of the current approaches that businesses are using for being a able to track those KPIs and be able to gain insight into what is making the different indicators move in different directions
Astasia Myers
0:28:07
in terms of what's necessary, it's obviously access to the data itself. So we've seen examples where they tie into Kafka, others tie into the data warehouse. The next layer is making sure you have a metric that is defined consistently for the product. And the third aspect is the rules engine or machine learning that it's applied to identify what are the appropriate bounds for this type of data alert on whether the data is outside those bounds and also help with root cause analysis. If there are challenges the way you phrased the question, you make it sound like there's been a number of incumbents offering KPI observability for a while, you know, that's just simply not the case and a.is one of the most established vendors and it was founded in 2014 most The companies that were tracking were founded in 2018 to present many of them are still building the solution. So it's hard to say that the current offerings are lacking. What's been great is that one of the challenges historically for this category was the integration piece around where you tie into the data pipeline data typically was fragmented for customers. Now there's actually clear design patterns that have emerged in the data pipeline. Five years ago, data warehouses weren't as mature. Now, snowflake redshift, BigQuery are kind of standards. People usually have a data warehouse, people are moving to ELT versus ETL. And then you think about enriching the data in the warehouse. These solutions can tie themselves to the data warehouse versus multiple data sources so it's easier to implement and extract value when data is aggregated in one location. The second challenge that we've seen With these products is kind of the need to make it a self serve model where customers can leverage pre existing metric definitions if they existed in something like look ml and apply it to KPI observability, or for them to easily define their own metrics. So this can be a process sometime there's a cultural element of defining metrics in businesses that can make it a little harder. And then the third challenge is really just the accuracy of identifying these anomalies and providing insights into the root cause, you know, the rules engines have to be tailored to not just the vertical that the company is operating in, but the company its itself in order to extract value. So it is a complicated solution to build and make sure that customers are extracting value, but we really do like the fact that implementation has gotten easier, and two That when we talk to buyers, this is incredibly top of mind, the value prop is clear. today. People want to look at information and have dashboards. There are hundreds of dashboards often in large enterprises, people aren't looking at them all the time. But they need to know if a KPI is out of whack. And these technologies are fundamental and allowing them to do so.
Tobias Macey
0:31:23
And then in order for any of these solutions to be useful and effective, you need to be able to collect and track the information that actually feeds into what is causing some of those different indicators to move and wondering what the challenges in identifying what those data sources are and then being able to potentially collect and associate them effectively. And then in terms of once that data is collected, what are the challenges for businesses in this observability space in terms of being able to display and analyze the data so that it is easy to interpret for people who might not necessarily have all of the training and being able to do their own analysis or do their own understanding of what all that data means and how it factors into the overall top level indicator of what they're trying to understand.
Astasia Myers
0:32:13
Well, most businesses typically have an operational and analytical database. So I would argue that the databases that you should be tying into are relatively clear in the customer environment. And just how do you implement your solution. As I said, I think most of the solutions that we have come across tied to the data warehouse, like snowflake, because that is where information is being aggregated on the analytical side in terms of how does a solution appropriately and clearly show value. And the challenges behind that one is the volume of data that we're that these databases are collecting is incredibly vast. And so when you were trying to do this analysis and run your rules, engine or machine learning over it, you actually Need to load some of this metadata into memory. So you can run your analysis. So it's the movement of the memory metadata into their architectures has to be fast. And they have to be able to fit the volume of data into memory onto have to be thoughtful about it. And then the second layer is the visualization of this people are operating in their bi solution, and often are looking at it a few times a day, if not all day long. And with these KPI, observability solutions, ideally, you'd be tying into bi so that you're alerting in the current visualization layer of the product. So it is imaginable to think that the BI solutions today will be offering this eventually. But if you're a third party vendor, you need to make sure that your UI is beautiful and very clearly identifies what is healthy and what is not healthy and then points to what could be the cause of the data change in the present and
Tobias Macey
0:33:58
over time, and then the last point that you called out as a trend that you're keeping an eye on for this year is streaming, which is obviously an area that's been growing rapidly over the past few years with a number of different open source and commercial options available, most notable being Kafka and then pulsar as one of its close competitors. And then in terms of the corporate space, there's data bricks, which has focused on their streaming capabilities. There's Flink, which is being used for a lot of stream processing per vague from the folks at EMC, wondering what you see as being the major business opportunities that you see for being able to make the streaming capability more accessible and easier to implement and more effective for businesses that are relying on it.
Astasia Myers
0:34:42
Yeah. So there's a few different ways we're thinking about improve streaming technologies. You're right. There's both the processing layer like Flink and then you have the more storage pub sub layer like Kafka speaking about the Kafka pub sub layer, which is the category We're spending the most time right now, not to say we're not interested in processing, but we just haven't seen as many novel offerings in that space. There's definitely a call to action to anyone listening, if you're compelled by that category and have the background to do so. But in terms of the streaming platform world, we think about improvements across four different buckets, speed, volume, management, and cost. Regarding speed, everything is moving to real time, like dashboards and workflows, and actions of data can flow faster actions and decisions can be faster. And when it comes to technologies, we've actually seen a few open source projects and other commercial offerings that can be 10 x faster than Kafka in production, a slight overhead to what the hardware can do itself. In terms of volume, more data is being created faster than ever before. This is we've been knowing this for decades now. It's hard to keep up with the data volume. And so new solutions need to be able to deal with High data volume and more topics in terms of management, we've been told zookeeper, which is core to Kafka is very hard to manage, you know, often people, staff, someone to manage the Kafka cluster, I appreciate that the team is replacing this component, but we believe the user experience from a management perspective can be even better. And we've heard that maintenance can be challenging because the number of topics can grow quickly. So teams are constantly balancing and upgrading instances which can be hard. And then finally on cost, you know, in terms of cost, you can think about it from two different lenses, the number of people you have to staff to keep the service up and running. I've heard of teams that have, you know, three plus people trying to manage their clusters, which can be very expensive given the rate of what a great engineer is. And then the second is the service themselves. You know, pulsar is interesting. It has a two tier architecture we're serving and storage can be simple scaled separately, which can decrease costs. And this is also really important for use cases with potentially infinite data retention, like logging where events can live forever. If you can move this to lower cost environments like s3 as compared to high performance disks, this can help with cost management as well. So it's really four things speed, volume, management and cost
Tobias Macey
0:37:24
and the cost aspect to pulsar and soon Kafka have the option of the tiered storage capability where you can keep the most recent data on that fast disk for access to recent topics, and then have different data automatically lifecycle off into s3, while still being accessible using the same API if you need to be able to run processing against historical information.
Astasia Myers
0:37:48
Yeah, exactly week. I think that was one of the biggest improvements between pulsar and Kafka, and it's great to see that Kafka will also be introducing this but we've heard from Bob This is incredibly useful. It really cuts down the cost for them and supports this long term storage, which is great in certain regulated industries.
Tobias Macey
0:38:09
And then in terms of the factors that are driving this overall growth in the need for access to streaming data and real time information, what are some of those driving elements, whether from the business landscape or the technical landscape that are pushing companies to try to adopt these capabilities?
Astasia Myers
0:38:27
It's interesting, I think it's a little bit of the consumer world flowing into the enterprise world, consumers have short attention spans want data immediately and want insights and answers as soon as possible. And we're starting to see this in the enterprise as well. You know, dashboards are moving to being real time, you know, we're refreshing at 10 minutes or less. Answers need to be as quick as possible and back end processes are all automated, and so the fresher the data, the better. We believe that is a huge catalyst because everything is moving to real time feedback, streaming apps produce and rely on a constant flow of this data. You know, common examples include predictive maintenance and fraud detection recommendation engines IoT, so all of this is increasing in terms of the volume but also the frequency in which is being collected. Data Science typically use streaming data, rather than batch to provide rapid insights. Similarly, AI and machine learning models leverage streaming data to constantly train and infer In short, like these three things make using streaming data across the board, more popular, it's pretty incredible. If you look at the market, it's growing significantly from around 690 million in 2018, to close to 2 billion in the next few years at 22% kegger over the period, which is really fast for most enterprise segments. I think this is actually under estimate of how Big the market is I think we'll slowly see the degradation of batch and most things will go to streaming unless it's exorbitantly expensive. But as we can see a lot of these news projects are open source and cost
Tobias Macey
0:40:12
effective. Yeah. And the interesting thing about these stream batch dichotomy is that a lot of the major proponents of stream processing and streaming data attest to that batches, just a special case of streaming and that you can and should just handle everything in the streaming context.
Astasia Myers
0:40:28
Yeah, it's really interesting to see that. I like how they it's very smart positioning. When you think about some of the batch processing engines, and I'm thinking about sparks, you know, Stream Processing layer, it's accurate, actually micro batches. So they take the reverse opinion of that, but yes, it is very smart position to say that batch is a sub component of streaming. I personally think that there's value in both systems, but there's going to be a single nificant migration to streaming over time, once again, I think that's the consumer appetite percolating into enterprise businesses and wanting data and answers faster than ever before.
Tobias Macey
0:41:13
And for businesses who are trying to adopt streaming, what are some of the barriers to entry that you're seeing and some of the missteps or mistakes that you see being made that could easily be addressed by having a vendor that works to paper over those challenges?
Astasia Myers
0:41:30
I think the main challenge that we see it, is the fact that there is a deficit of great data engineers and data platform teams. And often these businesses can't access those wonderful individuals. There's a concentration of talent at tech companies, and especially on both coasts as compared to broadly distributed across North America and Europe. And so sometimes they just don't have the data teams in place that they need. To be able to adopt these topics, you know, a traditional DBA is quite different than what a current data engineering professional is responsible for. And so vendors that provide hosted services or the customer support, there was a driver early on. And a lot of these businesses really helped get customers up and running with these systems. The second thing that is I think lighter, because I think most people appreciate the value of streaming is identifying a use case where they can see a clear ROI. And the cost is not exorbitantly expensive, right. And so, finding those use cases for all businesses is can be challenging at times. I think there's enough proof points and FinTech around trading and fraud detection in traditional enterprise around product user analytics, but it is still early days. I think there'll be more use cases that are unlocked and we discover in the future.
Tobias Macey
0:43:00
With your focus on these four major trends, how does that influence your overall investment of time and attention and where you make decisions as to where to actually put capital and play
Astasia Myers
0:43:12
interesting investors sit on a spectrum of being opportunistic and thematic, as you can kind of tell, I do a lot of First Person research. I like to publish that out to the community. So to share those insights on what we're hearing, so I leaned the Matic because I think it's helpful to deeply know the landscape and the technical differentiation in order to make literally data driven decisions about where we invest our capital and who we partner with. That's also really helpful, right when we have seen a lot of different vendors and startups and have a deep understanding of the category. When an entrepreneur founder comes to us with a piece of technology, we can truly appreciate the challenge and building that in the value of what they have done. So you know, I'm super excited about data and ml focus startups Overall, we believe that categories are massive, foundational and can result in large exits. Since I'm more thematic the four data themes that I discussed data quality data catalogs, KPI, observability, and streaming are particular areas we've been digging into further based on my research speaking with operators, and kind of our hypothesis of where the world is moving to. So it is when I do research and I share out my themes, those are particular areas that I'd love to talk to founders and love to find a partnership opportunity.
Tobias Macey
0:44:38
And then outside of those particular areas of focus, what are some of the other unaddressed markets or product care categories that you see which would be lucrative for new businesses, particularly in the data space.
Astasia Myers
0:44:50
So it's interesting, we think about kind of data infrastructure in terms of a Maslow's hierarchy of needs, and so kind of foundational, and at the base of the pyramid, you have data warehouses and beyond that you have ETL. And then you have BI. And the newer technologies we discussed today are more at the top of the pyramid, you know, more fast movers, early adopters are considering these technologies today like data quality, but we believe the industry is moving in this direction. And these pieces will eventually become a crucial component of the future data stack. And that's why we're evaluating them beyond the topics we talked about. I think there continues to be an opportunity to improve data ask us and usage for non technical users. You know, I appreciate this show is really targeted to people that are technologists, engineers, data platform leads or PMS in this space, but we're excited about new technologies that help the everyday business user access munge and leverage data. I think great examples of that is Air table, which completely changed how people think about spread sheets alteryx for data munging and cleaning. And then finally, you know, solutions that allow people to adopt ml in their workflows like forecasting, inventory management, financial projections. This is really neat. Previously, all of that was confined to, you know, data scientists and machine learning engineers that have such technical depth and with the commoditization of machine learning algorithms and new platforms that are making it easily accessible to the everyday business user. It's incredible to see how in the future business oriented employees will be able to access clean and create models themselves, creating huge business impact. So really excited about all data tools that can facilitate the work of non technical users.
Tobias Macey
0:46:54
And so in most areas of technology and in data in particular these days, there is a strong mix of open source and commercial solutions that are available for solving any given problem with varying levels of maturity and polish between them where a lot of the commercial options might just be an optimization of an open source platform that adds in some ease of use capabilities or additional security measures. And I'm wondering what your views are on the overall balance of this relationship in the data ecosystem.
Astasia Myers
0:47:25
device. I'm so happy you asked this question because it's one of the questions that we get asked the most as investors with founders operating in the data ml space, they always think, do I have to be open source? Should it be closed source? How do I demonstrate the value in a commercial offering when I have an open source project? Overall, we wholeheartedly believe that there that a great solution can be open and closed source, right. You have great open source projects like Kafka and spark and elastic and cockroach DB which are wonderful examples of venture backed startup that supporting an open source project and then adding Vout commercial value on top through security compliance integrations that will help with customer adoption. But there are also examples of closed source offers that are absolutely killing it, like snowflake and dromio. So there really is no one right answer. What we see is that the mix of open source to close source also depends on where in the stack you are operating core infrastructure, like databases, we're seeing even more of a movement to open source higher in the stack like bi traditionally, it's been closed source, but people are starting to adopt open source options like superset. If they're more technical user, we find is the closer the solution is to the business person, the less likely the solution is going to be open source, because these people are unlikely to know how to get it up and running. So a fully packaged solution is a better fit from them. What we often find is that open source is huge. For more of customer acquisition to generate pipeline, so technologists can pick up the solution, check it out themselves, see if it works, as I said, and then the company that supports the project can call on the user and try to convert them into a paying customer with their either in support, or an enterprise instance or a fully hosted offering this same dynamic on the go to market side that open source creates can be created by closed source technology. This can be through trials with self serve models, or sandbox environments. So it really can be approached with both form factors, I would think about two things, one, the technical capability of your buyer and two, how low in the stack you are operating, because there's more of a precedent lower in the stack for open source than higher.
Tobias Macey
0:49:48
And as you mentioned to the solutions as they get closer to the business user, the more likely they are to be commercial. But another element of that is that the lower Down elements in the stack that are being increasingly open source in terms of databases or streaming platforms are also the elements that are closest to the core data and how it's being stored and represented, which is where a lot of the potential lock in occurs. So business intelligence platforms, for instance, are fairly easy to swap out because you just need to connect it to the data source. But if your data is owned by a proprietary solution, and it's stored in a proprietary format, it creates a much stronger form of lock in and potentially adds an extra bit of resistance to technical implementers in terms of adopting that technology, because they don't want to be locked in and have their data held hostage by a platform that may not exist in 1015 years. And so having that migration capability built in is a strong concern there. And I'm wondering what your experience has been in that regard in terms of the companies that you work with and their views on which solutions they want to have open source or at least supporting open standard. versus being comfortable with buying a fully commercial solution? Well, we
Astasia Myers
0:51:03
see around adoption of open source at different layers in the stack. It depends on two things. One, the type of business the vertical that they're operating in to the maturity of the business, and three, the technical depth of their staff. In terms of verticals, we often see that laggard industries, like manufacturing energy, are still okay with closed source offerings. Some of this is because they don't have the technical staff in place to be able to support open source and hosted themselves in terms of maturity of the business. Often we see that early stage startups don't have the capital to spend on expensive commercial offerings so they have to go to open source to build their product and support their customers. In terms of technical depth to the team. We discussed this earlier. There's a concentration of talent of people that simply know how to set up, manage and remediate challenges of Kafka or Spark or cockroach dB. And sometimes teams can't hire these people, they don't have access to them. And so they need a commercial vendor to come in and help with this process. So it depends on a few different reasons in terms of the business itself. In terms of layers of the stack, we often see that because the user is technical at the infrastructure layer, they can manage open source and get it up and running. Once you touch a business analyst, and sometimes even data scientists, they don't have the technical capacity to get these solutions up and running them themselves. And that's why a packaged hosted offering is the best fit for them.
Tobias Macey
0:52:45
All right, well, for anybody who wants to follow along with you or get in touch and keep up to date with the work that you're doing. I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest guy In the to elegant technology that's available for data management today.
Astasia Myers
0:53:03
Yeah, we talked about it earlier, but I'm incredibly pumped about data quality, it is a top three priority for essentially every data executive or even C suite executive. The solutions in this space are early promising. And three, we think it's fundamental to any good data stack because bad data quality results in bad decision making, which could be incredibly detrimental, and and people presenting bad data, lose space and organizations. So we think this is going to be big witness across all enterprises. So data quality is top of mind if you're working on a data quality solution, please reach out to me I'd love to
Tobias Macey
0:53:39
chat with you. All right. Well, thank you very much for taking the time today to join me and share your expertise and experience of working with all these different companies across the different industries for being able to tackle the data challenges that exist. It's definitely a very important problem domain and one that is important to have necessary funding for these companies to be able to build the solution. They're trying to provide. So thank you for all of your effort on that front and I hope you enjoy the rest of your day.
Astasia Myers
0:54:04
Thanks so much for having me.
Tobias Macey
0:54:11
Listening Don't forget to check out our other show podcast dotnet at Python podcast comm to learn about the Python language its community in the innovative ways it is being used, and visit the site at data engineering podcast calm to subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects from the show, then tell us about it. Email hosts at data engineering podcast.com with your story, and to help other people find the show. Please leave a review on iTunes and tell your friends and co workers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!