Data Management Trends From An Investor Perspective

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

What advice do you wish you had received early in your career of data engineering?

If you hand a book to a new data engineer, what wisdom would you add to it?

I'm working with O'Reilly Media on a project to collect the 97 things that every data engineer should know, and I need your help.

Go to data engineering podcast.com/97

things to add your voice and share your hard earned expertise.

When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With their managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar to get you up and running in no time.

With simple pricing, fast networking, s 3 compatible object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode,

that's l I n o d e, today, and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

You listen to this show because you love working with data and want to keep your skills up to date.

Machine learning is finding its way into every aspect of the data landscape.

And SpringBoard has partnered with us to help you take the next step in your career by offering a scholarship to their machine learning engineering career track program.

In this online project based course, every student is paired with a machine learning expert who provides unlimited 1 to 1 mentorship support throughout the program via video conferences.

You'll build up your portfolio of machine learning projects and gain hands on experience in writing machine learning algorithms, deploying models into production, and managing the life of a deep learning prototype.

SpringBoard offers a job guarantee, meaning that you don't have to pay for the program until you get a job in the space.

The data engineering podcast is exclusively offering listeners 20 scholarships of $500 to eligible applicants.

It only takes 10 minutes, and there's no obligation.

Go to data engineering podcast.com/springboard

today and apply. Make sure to use the code AI springboard when you enroll. Your host is Tobias Macy. And today, I'm interviewing Astasia Myers about the trends in the data industry that she sees as an investor at Redpoint Ventures.

Oostasia, can you start by introducing yourself?

Hi. Yeah. I'm Oostasia Myers. I'm part of Redpoint Ventures early stage team focusing on enterprise. Thanks so much for having me today, Tobias.

Yeah. Do you remember how you first got involved in the area of data management or working with data companies?

Yeah. It's actually pretty cool. I started my career in sell side equity research

covering publicly traded

enterprise

companies. So I actually covered

Seagate, WD,

NetApp, EMC, kind of all the big players. This was an exciting time for storage. It was the era when all flash arrays were starting to come on the scene and software defined storage was all the rage. You know, pure storage at that time was still private. And, you know, digging in as a equity researcher really got me familiar

with the world of storage and data management.

Interestingly, I then transitioned to Cisco

where I was on the m and a and venture

investing team supporting

the core business units

of servers and networking. And we were spending a lot of time analyzing the storage market

and, did quite a few investments actually in that space that I led. So very proud of leading the Series c in Cohesity, which is now a unicorn in the backup and risk recovery space. And we also invested in Datos that Rubik acquired in SpringPath that Cisco

bought, and then also Elastifile that Google more recently acquired. And since my time at Cisco, I transitioned to Redpoint's early stage team, and I continue to look at data and ML focused startups, everything from new databases to ETL to ML tooling.

And I also share a lot of the research on the subject on my, Medium blog and Twitter account for others to learn more about the category.

And before we get too much further into

more of your background and the ways that you keep up to date with the industry, can you give a bit of an overview of what Redpoint Ventures works on and your role there? Yeah. Of course. So Redpoint is a Silicon Valley based VC firm that's been around for about 20 years. We currently have 2 funds, a venture fund and an early growth fund. I sit on the venture team. It's a 400

$1, 000, 000 vintage.

We are quite enterprise leaning. So b to b investments represent about

80% of deployed capital, and we invest

seed series a and series b out of that fund.

And we've had a long history

of investing in data companies. We've deployed over 2 $15, 000, 000 in capital over the last few years in those businesses

and have been honored to partner with startups like Snowflake, Looker, CockroachDB,

dGraph,

Pure Storage, Serial, Dremio, and Spring Path. So I've been doing it for quite a few years and continue to think there's opportunities in this space. It's definitely not slowing down.

My role in the team is enterprise focused investment professional.

I work with a handful of enterprise businesses today.

3 are in the data space. 1, we publicly announced

called CIRL, which is in the data layer access control and visibility,

and then 2 others in the data infrastructure world. So love all things data and looking to speak with startups in that space. From your perspective as an investor and somebody who's working with these companies, what is it about the overall category of data oriented businesses that you find appealing or attractive?

Yeah. There's there's a few different factors. You know, 1 thing that we like is these markets are absolutely

enormous. So according to EDC,

big data and business analytics market is around a 190, 000, 000, 000

in annual spend and continuing to grow at a double digit clip and should be

close to 275, 000, 000, 000

in just 2 years. So absolutely enormous.

To put that in perspective,

IDC for the same year says information security spend is only a 120, 000, 000, 000

So big data is huge. You know, 65, 000, 000, 000 larger in terms of scale. And it's not just the overall market. If you look at individual subcategories.

You know, you have BI and analytics that's about 20, 000, 000, 000, you have database management that's 50, data warehouses that are about 20, 000, 000, 000. They're just huge categories that we rarely see in enterprise.

The second thing is because they're big categories, there's a precedent of large outcomes

in all the subcategories. If you look at BI, you have Tableau and Looker and Qlik and ThoughtSpot and Domo.

Databases, of course, Oracle,

SAP. We've seen Mongo and Cloudera.

Data Management, TIBCO and Informatica.

Log Management, Elastic and Splunk. So it's awesome. You can, as an investor, you can point to

businesses that have had really successful

exits and become enduring companies, which is what we look for. There's a few other things that we like in terms of the market dynamics.

These subcategories

are large,

but there's also a winner take a lot or winner take most dynamic, which is excellent. So when I was covering publicly traded companies like EMC, they had the largest

market share. It was around 35%,

which is pretty impressive for any category. So really an oligopoly,

style of market here.

And then the tech, you know, it's really hard to build this type of technology.

Differentiation matters and is felt by the user. And this technical differentiation can be a defensible moat for the business over time.

And because of that, finally, you know, these products are really sticky. It's core infrastructure.

Once the technology is adopted, there's

a moat since it's friction full to rip it out. And this means

longer contracts that are potentially larger. All things we like to see. And in terms of the information that you rely on for being able to keep up to date on what's happening in the data industry and understand what's relevant and what businesses

are going to perform well given the overall economy

and the overall environment that they're working within and building on? What are those types of information that you look at, and how do you gain that understanding?

Yeah. So as you can imagine, there's really no 1 source we go to. We take a MOSAIC

approach to our research. So, you know, everything from podcasts, like

like data engineering. You know, I've been a long time listener and so honored to be here today. But also

newsletters like O'Reilly and Data Council or even from the public cloud vendors that talk about their new product releases.

Social media is a great outlet. So Reddit and Twitter

before we'd be going to events like Strata and meetups around open source technologies.

And you know my personal favorite because I came from academia and sell side research is speaking to operators and buyers. You know, these calls are often the ones that I have the most fun in,

during my day. They're just my favorite because, you know, it cuts out the noise and the hype. It goes straight to the people who are working with these technologies and thinking about their architectures.

And, you know, we actively engage operators and have a network.

For anyone who's listening who operates in the data space, please email me, and we'll be added adding you to our community events

and engagement.

We think that the insights from the operators, the people on the ground, are fundamental in how we get information and make smart decisions.

And then within those different information sources, it can often be difficult to pull out the useful signal from the noise because of the variety

of

ways that it's being represented and potential biases

in terms of how things are presented. So what is your personal heuristic

for determining the relevance of any given piece of information

and factoring that into your overall decision as to whether or not a given company or category is worthy of investment or further investigation.

Yeah. It's funny. We're we're on a show that talks about data and data's information. And like everyone, we're inundated with information all the time. You We're really looking for a needle in the haystack.

Any tidbit of information

that we read,

can actually change

how we're thinking about the world or what we're doing in a single day. Right? If I come across something fascinating, I may spend, a few hours trying to investigate further so you're totally right. You know, I look for indicators and information that suggest 3 things,

novelty,

game changing,

and this is the future. You know novelty is pretty clear. Is this new, original, or unusual? Have I heard about this before?

We see so much information every day that just being novel can be a big deal. Game changing, is there sentiment around the information? Is this considered ground breaking? Does this change someone's process or daily activities

or their role?

And then the future, we kind of think about is this something that most people will adopt and implement? Can it become ubiquitous in a few years? And so if the information that I'm reading suggests those 3 things,

it's pretty exciting for us and we dig in further. And so as somebody who works closely with a variety of different companies across different verticals and areas of focus or problem domains,

what are some of the common trends that you have identified in the overall data ecosystem that you're keeping an eye on and that you're spending your focus on in terms of identifying potential up and comers that are worthy of investment? We are seeing 3 trends today.

1 is kind of flexibility

around delivery models of the technology.

2 is the data security matters more now than ever before, and this could be data security solutions

or data security in products.

3 is increased adoption of open source.

So in terms of the flexible delivery models, you know, previously everything was on premise. Now it can be in private cloud, public cloud, or a system that sits across environments.

Additionally, a lot of the solutions can be in a fully hosted SaaS, which we all know about,

but we're starting to see the emergence of what we called Cloud Prem.

And the idea behind Cloud Prem, which is sometimes called VPC for virtual private cloud, is that it's a new architecture that splits the SaaS application into code and data. What's interesting about this is that the SaaS company writes, updates, and maintains the code, and the customer manages the data. So an example of this would be the control plane is in the cloud, but the data operators are in the private

public cloud VPC. And this is 1 of the biggest changes we've seen recently.

Customers are adopting this approach for a few different reasons. The first is cost. So the cloud is more expensive for large enterprises.

Many are considering moving back to their own infrastructure

and using cloud only to manage first. This is kind of what we saw originally,

almost 10 years ago. And so this architecture can be more cost effective

for these buyers. The second is control over data and access management.

You know, this is often the crown jewels of the business. And the third is compliance. You know, we're seeing more regulation than ever before with CCPA and GDPR.

A lot of business wanna be better prepared

and structure,

their data and how they engage with software applications

more centrally for these reasons.

And

what's interesting about this new Cloud Prem approach is that it really changes the customer controls of the data, and therefore, the power dynamics

with the vendor. So when the customer controls the data, they can reshape it, move it, share it with a competing vendor,

and the customer can create their own applications on top of the data. And the integration between applications no longer has to be at the application layer with Salesforce

or Marketo APIs. It can now be done at the database.

So it's really interesting the unlocking of potential with this new delivery model and People

wanna

control

the data themselves for compliance. SOC 2, People wanna control the data themselves for compliance.

SOC 2 is becoming a requirement for very early stage startups to sell it to the enterprise. It is becoming the standard. And simply, no 1 wants to be in the news because of a data breach. You know, they've been very high profile from Capital 1 to Marriott. No 1 wants that, so data security is key in these platforms. And then the third is increased adoption of open source. You know, open source has existed

in the data layers for a long time, Postgres,

MySQL,

but now we're seeing it touch nearly every type of database from graph to analytical.

We see it moving up the stack to data catalogs, processing engines, ETL, and data quality. While you don't need to deliver

data software as open source, you know, Snowflake, which we're investors in is a clear example of this. There is movement in the space towards open source technology. People want to see the code, check it out, the tech for themselves, see if it works in their environment, and adopt it if it does without dealing with see if it works in their environment, and adopt it if it does without dealing with the sales team. And open source allows them to do that. It gives them a flexibility to be less reliant on the vendor,

potentially decrease costs, and increase control. Going back to your point about the private SaaS model of deploying the actual service into the customer's environments, there have been a few different people that I've spoken with as well that are taking advantage of that. And it's definitely an interesting delivery model, notable examples being data coral and snowplow.

And there's also another company called ChaosSearch, all of which allow you to store the data in s 3 or wherever your own applications are, and then they'll deploy the software to your environment and manage it. But you still, as you said, have control over the overall life cycle of the information, and you don't have to worry about handing it off to a third party and then requesting access to it to get it back for things like Google Analytics. And so for data in particular, where it does have that gravity and it does have the increased cost of moving it to and from different environments,

then as you said, it's a cost savings as well as a control element.

Totally. It's really exciting to see this new architecture. We actually started seeing this emerge,

let's say, about 2 years ago.

Not in the data space, but businesses that exemplify this as Mattermost, which is an open source alternative Slack,

and Bitwarden, which is an open source password manager. They have these models. We have an entire thesis

around investing in companies in this space. Both of the 2 that I just mentioned are in our portfolio.

And we think it's really exciting, you know, with increased regulation,

higher cost for the public cloud. This gives

buyers an opportunity to safely adopt technology that meets the regulatory

requirements.

And so you also wrote an article recently that was highlighting the 4 main trends that you're keeping an eye on for 2020 in the data space. And those call out in particular the elements of data quality, data catalogs,

observability

of the

influences for critical business indicators or KPIs

and streaming data. So taking those in turn,

starting with the data quality aspect, what are some of the driving factors that influence that quality, and what elements of that problem space are being addressed by the companies that you're watching? Yeah. So to make sure everyone's on the same page,

data quality management ensures that data is fit for consumption and meets the needs of data consumers.

To be high data quality, data must be consistent and unambiguous.

You can measure data quality through a few different dimensions,

accuracy, completeness,

integrity,

validity,

among others.

And there really isn't 1

data or excessively Capturing the wrong data or excessively collecting it can lead to shortcuts for reporting, so there could be bad data quality.

Data quality issues are often a result of database mergers or systems and cloud integration processes in which data fields that should be compatible are not due to schema or format inconsistencies

or unclear field definitions.

You know, at the basic level, manual steps of data entry and manipulation

can cause problems.

And finally, now there's a fragmentation

of information systems

leading to bad migrations or data duplication.

So that data can become stale and out of date, which is also a form of bad data quality. And then fundamentally, you know, data can be corrupted

or there could be changes in source systems that can lead to bad results. The companies that we're tracking can be categorized across 2 vectors.

1 is internal versus externally generated data. And 2 is

in motion

or

at rest data quality evaluation.

So with regards to the first vector,

you know, we can think about third party data that businesses and verticals like finance, real estate, and health care adopt. They're ingesting this third party data to inform their systems.

And often, the data has not been cleaned

and prepped by the vendor, and so they need to adopt technology to make sure the data fits well. And this is very different than internally generated data, like customer data that you would get in tech and e commerce and CPG, where they need to look the data their systems were organically generating. And then the second vector

is around evaluating

data quality for data in motion

versus

at rest. So in motion tying into Kafka streams or Pulsar to augment bad data real time before it reaches the sync, Or at data at rest, there's solutions out there that can scan databases for no values, inconsistent formatting,

or changes in the distribution of data. So you can make sure that the current data you're seeing at rest mirrors

historical

distributions to make sure there aren't any issues.

And so in terms of the overall

data quality landscape, what are some of the unsolved areas that you see as being viable options for newcomers or new businesses to try and tackle and that businesses would be able to gain value from and are actively looking for? Yeah. What what excites me about data quality is that it's foundational

to businesses,

human and,

machine

decision making. You know, dirty data can result in incorrect values and dashboards and executive brace things. It's kinda crazy. We've heard about bad data leading to product development decisions that can cost corporations 1, 000, 000 of dollars in engineering

effort. And then, you know, with

machine made decisions based on bad data, it could lead to bias or

incorrect

actions that could create a bad user experience.

We've come across a few startups

and open source projects operating this space, SOTA, Toro,

Monte Carlo, Great Expectations,

DBT and Nexus Data. It's kind of the Wild West in terms of data quality. It's

an idea that's been top of mind for senior leaders for the past few years, but there really haven't been great tools out there to solve it.

Some teams have built systems internally

to identify data quality, but there hasn't been a a platform that's emerged just yet. Most of the startups I mentioned

have only been around for 2 to 3 years and are early in their journey.

So we think there's a lot of opportunity in the space as this is a top priority for senior leaders.

And another element that ties into the overall data quality question to

ensure

to ensure that the processes that are being run on it aren't aligning important information

or introducing inaccuracies

or old or biased data. And that is a big portion of what's covered in the overall concept of data catalogs or metadata management. And I'm wondering what you're seeing as being the main challenges that businesses face in establishing and maintaining those data catalogs and being able to have robust mechanisms for managing all that metadata.

Yeah. Data catalogs

are super interesting because they capture rich information about data, including the application context, behavior, and change, and the lineage as you noted.

It's pretty neat technology because it supports self serve data access, empowering individuals and teams,

so that they don't actually have to work with IT to receive data, and they can discover it themselves, what's relevant.

And this actually helps improve productivity of ML and data scientists teams.

The other thing we like is, as you noted, they

can address,

PII. They can discover it. And so you can do controls on who can access PII data.

Some of these challenges faced by businesses establishing data catalogs is the implementation.

As you can imagine, there's fragmentation

of data across different silos,

databases, storage layers, sometimes in Excel. There are many resources you need to tie into, and this could be hard to implement a solution.

And second is really around user education adoption. When we talk to buyers,

people often say that

theoretically,

they understand the value

of a data catalog because the team no longer needs to work with IT,

which can be a bottleneck for data access

and they can actually get fresher data by having a self serve model.

But often

we hear that these individuals,

you know, have to experience a data catalog themselves

to fully appreciate the value. And I think that's why we're starting to see now the emergence of many different players. It's taken a few years

to frame the value

and gain visibility.

And now teams are starting to adopt it. And there are, as you mentioned, a few different entrance to this market. Some of which are fairly well established. The 1 that comes to mind most readily is Alation, but there are also a number of open source options.

The 1 that comes to mind are Amundson from Lyft or the Datahub project from LinkedIn. And I'm wondering in terms of the available options that do exist, what you see as being the overall shortcomings in those products and that might inhibit their adoption or Yeah.

You

Yeah. You're right. It's interesting to see that

Alation and Kleber have been around for the a while. They're closed source, enterprise oriented products, and there's been this new emergence of open source projects from Lyft, LinkedIn,

Netflix and then other large businesses like Airbnb and Uber have built their own and publicly talked about it, but not open sourced it just yet. The ways we look at the different types of technologies

is, first, you know, closed source versus open source. You know, we are agnostic to that approach. But then also what data sources do they ingest from, the stack they use, and the functionality

that they support.

You know, there's a broad range of functionality. Everything from

I'm looking at sample rows, data profiling, freshness metrics, ownership, top users inquiries, and lineage, in addition to the fundamentals of understanding schemas and metadata.

And then in terms of stacks, you know, you have Amundsen, which is Python node

and,

uses databases like Neo4j as an Elastic. Well, MediCat is Java and Elastic only based. And then finally, in terms of data sources, you know, this is how

it could get implemented in an environment.

For those that are using Airflow

for DAG orchestration,

Amundsen is has a Python library

to integrate at that point.

Other solutions like LinkedIn's Data Hub tie directly

into Presto and MySQL and Oracle

via API calls or Kafka events. So it really depends on a few factors as I noted. The breadth of the functionality that you're hoping to get from your data catalog,

the stack that you are familiar with and comfortable with adopting, and finally, what are your data sources

and your perspective of how best integrate a data catalog. If you're not using Airflow, Edmonton may not be the best choice for you. A lot of businesses are using,

Airflow, so it's a great option.

It just really depends on your local environment.

And I think that's why we see so many different offerings in the space today. Another piece that isn't specifically a data catalog, but that I was impressed by who I spoke with a while ago were the folks behind the Marquez project out of WeWork as a means

you know, processes that produce the end results? Yeah. That's a really interesting project

that came out of WeWork. I think

foundationally for all of these data catalogs,

it needs to be a seamless implementation into the environment, as I said, based on the data sources

or orchestration layer that you're using. All of the data should be prepopulated.

It should have freshness. It should be tied

to a data steward,

which could be auto generated based on who is collecting the data,

or tying to your LDAP systems. As you can know who should be accessing the data. 1 of the things that I think is most powerful is the search functionality

in some of these platforms for typing in a keyword and they auto discover tables based on relevancy,

people within your network, and the owner. So you can have the freshest data available. And moving to the overall idea of observability of KPIs and the different factors that influence them and being able to dig into those different elements,

what are the

overall capabilities that are necessary in those types of systems? And what is lacking in any of the current approaches that businesses are using for being able to track those KPIs

and be able to gain insight into what is

different indicators move in different directions? In terms of what's necessary, it's obviously access to the data

itself. So,

we've seen examples where they tie into Kafka, others tie into the data warehouse.

The next layer is making sure you have a

metric that is defined,

consistently

for the product. And the third aspect is the

rules engine or machine learning,

that is applied to identify what are the appropriate bounds

for this type of data, alert on whether the data's outside those bounds,

and also help with root cause analysis if there are challenges.

The way you phrase the question, you make it sound like there's been a number of encumbrance offering

KPI observability for a while. You know, that's just simply not the case. Anadot is 1 of the most

established vendors and it was founded in 2014.

Most of the companies that we're tracking were founded

in 2018 to present. Many of them are still building the solution, so it's hard to say that the current offerings are lacking.

What's been great is that 1 of the challenges historically for this category was the integration piece around where you

tie into the data pipeline.

Data typically was fragmented for customers.

Now there's actually clear design patterns that have emerged in the data pipeline.

5 years ago, data warehouses weren't as mature.

Now Snowflake,

Redshift, BigQuery are kind of standards. People

usually have a data warehouse. People are moving to ELT versus,

ETL.

And then you think about enriching the data in the warehouse. These solutions can tie themselves to the data warehouse versus multiple data sources. So it's easier to implement and extract value

when data is aggregated

in 1 location.

The second challenge that we've seen with these products

is kind of the need to make it a self serve model where customers can leverage pre existing metric just

definitions if they existed in something like LookML

and apply it to KPI observability

or for them to easily define their own metrics.

So this can be a process. Sometimes, there's a cultural

element of defining metrics in businesses that can make it a little harder.

And then the third challenge is really just the accuracy

of identifying these anomalies

and providing

insights

into the root cause. You know, the rules engines have to be tailored to not just the

vertical that the company is operating in, but the company its itself

in order to extract value.

So it is a complicated

solution to build and make sure that customers are extracting value, but we really do like the fact that implementation

has gotten easier.

And 2, that when we talk to buyers, this is incredibly top of mind. The value prop is clear. Today,

people

want to look at information and have dashboards.

There are hundreds of dashboards often in large enterprises. People aren't looking at them all the time, but they need to know if a KPI is out of whack, and these technologies are fundamental

in allowing them to do so. And then in order for any of these solutions

to be useful and effective, you need to be able to collect and track the information that actually feeds into

what is causing some of those different indicators to move. I'm wondering

what the challenges in identifying what those data sources are and then being able to potentially

collect and associate them effectively.

And then in terms of once that data is collected, what are the challenges for businesses in this observability

space in terms of being able to display

and analyze the data so that it is easy to interpret for people who might not necessarily

have all of the training and being able to do their own analysis or do their own understanding of what all that data means and how it factors into the overall top level indicator of what they're trying to understand. Well, most data scientists typically have an

operational and analytical database. So I would argue that

the databases that you should be tying into are relatively clear in the customer environment. It's just how do you implement your solution. As I said, I think most

of the solutions that we've come across tied to the data warehouse,

like Snowflake because that is where information is being aggregated on the analytical side. In terms of how does a solution

appropriately

and clearly

show value,

and the challenges behind that. 1 is the volume of data that were that these databases are collecting is incredibly vast.

And so when you were trying to do this analysis and run your rules engine or machine learning over it, you actually need to load some of this metadata into

memory so you can run your analysis. So it's the movement of the mem metadata into

their architectures has to be fast, and they have to be able to fit the volume of data

into memory. So you have to be thoughtful about it. And then the second layer is the visualization

of this. People are operating in their BI solution

and often are looking at it a few times a day, if not all day long. And with these KPI observability

solutions,

ideally,

you'd be tying into BI so that you're alerting

in the current visualization

layer of the product. So it is imaginal to think that the BI solutions today will be offering this eventually. But if you're a third party vendor, you need to make sure that your UI is beautiful and very clearly identifies what is healthy and what is not healthy. And then points to what could be the cause of the data change in the present and over time. And then the last point that you called out as a trend that you're keeping an eye on for this year is streaming, which is obviously an area that's been growing rapidly over the past few years with a number of different open source and commercial options available, most notable being Kafka and then Pulsar as 1 of its close competitors.

And then in terms of the,

corporate space, there's Databricks, which is focused on their streaming capabilities.

There's Flink, which is being used for a lot of stream

processing per Vega from the folks at EMC.

Wondering what you see as being the major business opportunities

that you see for being able to make the streaming capability

more accessible and easier to implement and more effective for businesses that are relying on it? Yeah. So there's a few different ways we're thinking about improved streaming technologies.

You're right. There's both the processing layer

like Flink, and then you have the more storage pub sub layer like Kafka. Speaking about the Kafka pub sub layer, which is, the category that we're spending the most time right now,

not to say we're not interested in processing, but we just haven't seen as many novel offerings in that space. There's definitely a call to action to anyone listening,

if you are compelled by that category and have the background to do so. But in terms of the streaming platform world, we think about improvements across 4 different buckets,

speed,

volume,

management, and cost.

Regarding speed, everything is moving to real time like dashboards and workflows and actions. If data can flow faster,

actions and decisions can be faster.

And when it comes to technologies,

we've actually seen

a few open source projects and other commercial offerings that can be 10x faster than Kafka in production, a slight overhead

to what the hardware can do itself. In terms of volume, more data is being created faster than ever before. This is we've been knowing this for decades now. It's hard to keep up with the data volume,

and so new solutions need to be able to deal with high data volume and

more topics.

In terms of management, we've been told

ZooKeeper, which is core to Kafka, is very hard to manage.

You know, often people staff someone to manage the Kafka cluster. I appreciate that the team is replacing this component, but we believe the user experience from a management

perspective can be even better.

And we've heard that maintenance can be challenging because the number of topics can grow quickly, so teams are constantly

balancing and upgrading instances, which can be hard. And then finally on cost, you know, in terms of cost, you can think about it from 2 different lenses. The number of people you have to staff to keep the service up and running. I've heard of teams that have, you know, 3 plus people trying to manage their clusters,

which can be very expensive given

the rate

of what a great engineer is. And then the second is the service themselves.

You know, Pulsar is interesting. It has a 2 tier architecture where serving and storage can be scaled separately,

which can decrease costs. And this is also really important for use cases with potentially infinite data retention, like logging where events can live forever. If you can move

this to

lower cost,

environments like s 3 as compared to high performance disks, this can help with cost management as well. So it's really 4 things, speed, volume, management, and cost. And on the cost aspect too,

Pulsar and soon, Kafka

have the option of the tiered storage capability where you can keep the most recent data on that fast disk for access to recent topics

and then have different data automatically

life cycle off into s 3 while still being accessible using the same API if you need to be able to run processing against historical information?

Yeah. Exactly. We I think that was 1 of the biggest improvements

between

Pulsar and Kafka.

And it's great to see that Kafka will also be introducing this, but we've heard from buyers this is incredibly useful. It really cuts down the cost for them and supports this long term storage,

which is great in certain, regulated industries.

And then in terms of the factors that are driving this overall growth and the need for access to streaming data and real time information,

what are some of those driving elements, whether from the business landscape or the technical landscape

that are pushing companies to try to adopt these capabilities?

Yeah. It's interesting. I think it's a little bit of the consumer world

flowing into the enterprise world. Consumers have short attention spans, want data immediately, and want insights and answers as soon as possible. And we're starting to see this in the enterprise as well. You know, dashboards

are moving to being real time. You know, we're refreshing

at 10 minutes or less. Answers need to be as quick as possible,

and,

back end processes are all automated. And so the fresher the data, the better. We believe this is a huge catalyst

because everything is moving to real time feedback. Streaming apps produce and rely on a constant flow of this data. You know? Common examples include predictive maintenance and fraud detection, recommendation

engines,

IoT.

So all of this is,

increasing in terms of the volume but also the frequency in which it's being collected.

Data science typically use streaming data rather than batch to provide rapid insights. Similarly, AI machine learning models leverage streaming data to constantly train and infer. In short, like, these 3 things make

using

streaming data across the board

more popular. It's pretty incredible. If you look at the market, it's growing significantly

from around

698, 000, 000,

in 2018

to close to 2, 000, 000, 000 in the next few years, a 22%,

CAGR over the period, which is really fast for most enterprise segments.

I think this is actually an underestimate

of how big the market is. I think we'll slowly see the degradation of batch and most things will go to streaming unless it's exorbitantly expensive. But as we can see, a lot of these new projects are, open source and cost effective. Yeah. And the interesting thing about the stream batch dichotomy is that a lot of the major proponents of stream processing and streaming data

attest that batch is just a special case of streaming and that you can and should just handle everything in the streaming context?

Yeah. It's really interesting to see that.

I like how they it's very smart positioning,

I would say.

It's when you think about some of the batch processing engines,

and I'm thinking about Spark's,

you know, stream processing layer, it's actually

microbatches.

So they take the reverse opinion of that. But yes, it is very smart positioning to say that batch is a subcomponent

of streaming. I personally think that,

there's value in both systems, but there's gonna be a significant migration to streaming over time. Once again, I think that's

the consumer

appetite

percolating

into

enterprise businesses and wanting data

and answers faster than ever before.

And for businesses who are trying to adopt streaming, what are some of the barriers to entry that you're seeing and some of

the missteps or mistakes that you see being made that could easily be addressed by having a vendor that works

to paper over those challenges?

I think the main

challenge that we see

is the fact that there is a deficit of great

data engineers and data platform teams, and often

these businesses,

can't access those wonderful

individuals. There's a concentration of talent at tech companies

and especially on both coasts as compared to

broadly distributed across

North America and Europe. And so sometimes they just don't have the data teams in place that they need to be able to adopt these topics. You know, a traditional DBA is quite different

than what a current,

data engineering professional is responsible for.

And so vendors that provide hosted services

or the,

customer support, who was a driver early on in a lot of these businesses, really helped get

customers up and running with these systems.

The second thing that is, I think, lighter,

because I think most people appreciate the value of streaming is identifying

a use case where they can see a clear

ROI and the cost is not exorbitantly

expensive. Right? And so

finding those use cases for all businesses is can be challenging at times. I think there's enough proof points in Fintech around trading and fraud detection

in traditional enterprise around product

user

analytics, but it is still early days. I think there'll be more use cases that are unlocked and we discover in the future. And then with your focus on these 4 major trends, how does that influence your overall investment of time and attention and where you make decisions as to where to actually put capital in play? Yeah. It's interesting. Investors sit on a spectrum of being opportunistic

and thematic.

As you can kind of tell,

I do a lot of first person research. I like to publish that out to the community so to share those insights and what we're hearing. So I lean thematic

because I think it's helpful to deeply know the landscape and the technical differentiation

in order to make literally data driven decisions about where we invest our capital and who we partner with. That's also really helpful. Right? When we have seen

a lot of different vendors

and startups and have a deep understanding of the category,

when an entrepreneur or founder comes to us with a piece of technology, we can truly appreciate

the challenge in building that and the value

of what they have done.

So, you know, I'm super excited about data and ML focused startups overall.

We believe that categories are massive,

foundational,

and can result in large exits. Since I'm more thematic, the 4 data themes that I discussed, data quality, data catalogs, KPI observability,

and streaming are particular

areas we've been digging into further based on

my research speaking with operators

and kind of our hypothesis

of where the world is moving to.

So it is when I do research and I share out my themes,

those are particular areas that I'd love to talk to founders and love to find a partnership opportunity.

And then outside of those particular areas of focus, what are some of the other unaddressed markets or product care categories that you see which would be lucrative for new businesses, particularly in the data space? So it's interesting. We think about,

kind of data infrastructure

in terms of a Maslow hierarchy of needs.

And so kind of

foundational and at the base of the pyramid, you have data warehouses,

and beyond that you have ETL and then you have BI.

And the newer technologies we discussed today

are more at the top of the pyramid. You know, more fast movers,

early adopters

are considering these technologies today, like data quality. But we believe the industry is moving in this direction, and these pieces will eventually become a crucial component of the future data stack. And that's why we're evaluating them. Beyond the topics we talked about, I think there continues to be an opportunity

to improve data ask us and usage for non technical users.

You know, I appreciate this show is,

really targeted to people that are technologists,

engineers,

data platform

leads, or PMs in this space, but we're excited about new technologies

that help the everyday

business user

access,

munch,

and leverage data. I think great examples of that is Airtable,

which completely changed how people think about spreadsheets,

Alteryx for

data munging and cleaning,

and then finally, you know, solutions

that allow people to adopt ML in their workflows,

like forecasting, inventory management,

financial projections.

This is really neat. Previously, all of that was confined to, you know, data scientists

and machine learning engineers that have such technical depth. And with the commoditization

of

machine learning algorithms and new platforms that are making it easily accessible

to the everyday business user. It's incredible to see how,

in the future, business oriented employees will be able to access, clean, and create models themselves,

creating huge business impact. So really excited about

all data tools that can facilitate

the work of non technical users.

And so in most areas of technology

and in data in particular these days, there's a strong mix of open source and commercial solutions that are available for solving any given problem with varying levels of maturity and polish between them, where a lot of the

commercial options

might just be an optimization of an open source platform that adds in some ease of use capabilities

or additional security measures. And I'm wondering what your views are on the overall balance of this relationship in the data ecosystem.

Well, Tobias, I'm so happy you asked this question because it's 1 of the questions that we get asked the most as investors

with founders

operating in the data ML space. They always think, do I have to be open source? Should it be closed source? How do I demonstrate the value in a commercial offering

when I have an open source project?

Overall, we wholeheartedly believe that there

that a great solution can be open and closed source. Right? You have great open source projects like Kafka and Spark and Elastic and CockroachDB,

which are wonderful examples

of a venture backed startup that's supporting an open source project and then adding commercial value on top through security,

compliance,

integrations

that will help with customer

But there are also examples of closed source offers that are absolutely killing it, like Snowflake and Dremio. So there really is no 1 right answer. What we see is that the mix of open source to closed source also depends on where in the stack you are operating.

Core infrastructure

like databases, we are seeing

even more of a movement to open source.

Higher in the stack like BI, traditionally, it's been closed source, but people are starting to adopt open source options like Superset

if they're more tech technical user. What we find is the closer the solution is to the business person, the less likely the solution is going to be open source because these people are unlikely to know how to get it up and running. So a fully packaged

a fully packaged solution is a better fit from them. What we often find is that open source is used for more of customer acquisition to generate pipeline.

So technologists can pick up the solution, check it out themselves, see if it works as I said, and then the company that supports the project can call on the user and try to convert them into a paying customer

support or an enterprise instance or a fully hosted offering. This same dynamic,

on the go to market side that open source creates can be created by closed source technology. This can be through

trials with self serve models

or sandbox environments.

So it really can be approached

with both form factors.

I would think about 2 things. 1, the technical capability of your buyer, and

2, how low in the stack you are operating

because there's more of a precedent in lowering the stack for open source than higher. And as you mentioned to the solutions as they get closer to the business user, the more likely they are to be commercial. But another element of that is that the lower down elements in the stack that are being increasingly open source in terms of databases or streaming platforms

are also the elements that are closest to the core data

and how it's being stored and represented, which is where a lot of the potential lock in occurs. So business intelligence platforms, for instance, are fairly easy to swap out because you just need to connect it to the data source. But if your data is owned by a proprietary solution and it's stored in a proprietary format, it creates a much stronger form of lock in and potentially

adds an extra bit of resistance

to technical implementers in terms of adopting that technology because they don't want to be locked in and have their data held hostage by a platform that may not exist in 10, 15 years. And so having that migration capability

built in is a strong concern there. And I'm wondering what your experience has been in that regard in terms of the companies that you work with and their views on which solutions they want to have open source or at least supporting open standards versus being comfortable with buying a fully commercial solution? What we see around adoption of open source at different layers in the stack,

it depends on 2 things. 1, the type of business, the vertical that they're operating in.

2,

the maturity of the business.

And 3,

the technical depth of their staff.

In terms of verticals, we often see that

laggard industries

like manufacturing,

energy,

are still okay with closed source offerings.

Some of this is because they don't have the technical staff in place to be able to support open source and host it themselves.

In terms of maturity of the business,

often we see that early stage startups don't have the capital

to spend

on

expensive

commercial offerings. So they have to go to open source to build their product and support their customers.

In terms of technical depth to the team, we discussed this earlier. There's a concentration of talent of people that simply know how to

set up, manage,

and remediate challenges of, Kafka or Spark or CockroachDB.

And sometimes teams can't hire these people. They don't have access to them, and so they need a commercial

vendor to come in and help with this process. So it depends on a few different reasons in terms of the business

itself,

in terms of layers of the stack. We often see that because the user is technical at the infrastructure layer, they can manage open source and get it up and running. Once you touch

a business analyst and sometimes even data scientists, they don't have the technical capacity to get these solutions up and running them themselves. And that's why a packaged

hosted offering is the best fit for them. Alright. Well, for anybody who wants to follow along with you or get in touch and, keep up to date with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling our technology that's available for data management today. Yeah. We talked about it earlier, but I'm incredibly pumped about data quality.

It is a top 3 priority for essentially every data executive or even c suite executive.

The solutions in the space are early promising.

And 3, we think it's, fundamental to any good data stack because bad data quality

results in bad decision making which could be incredibly detrimental

and it and people presenting bad data lose face in organizations. So we think this is going to be ubiquitous

across all enterprises. So data quality is top of mind. If you're working on a data quality

solution,

please reach out to me. I'd love to chat with you. Alright. Well, thank you very much for taking the time today to join me and share your expertise and experience

of working with all these different companies across the different industries for being able to tackle the data challenges that exist. It's definitely a very

important problem domain and 1 that is important to have necessary funding for these companies to be able to build the solutions that they're trying to provide. So thank you for all of your effort on that front, and I hope you enjoy the rest of your day. Yeah. Thanks so much for having me.

Listening. Don't forget to check out our other show, podcast dotinit@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links