Mapping The Data Infrastructure Landscape As A Venture Capitalist

Hello, and welcome to the Data Engineering podcast. The show about modern data management.

Businesses that adapt well to change grow 3 times faster than the industry average. As your business adapts, so should your data.

RudderStack Transformations lets you customize your event data in real time with your own JavaScript or Python code.

Join the RudderStack transformation challenge today for a chance to win a $1, 000 cash prize just by submitting a transformation to the open source RudderStack transformation library.

Visitdataengineeringpodcast.com/

rudderstack today to learn more.

Your host is Tobias Macy, and today I'm interviewing Matt Turc about his annual report on the machine learning, AI, and data landscape and the insights around data infrastructure that he has gained in the process. So, Matt, can you start by introducing yourself? Yeah. Absolutely. Thanks for having me. A long time listener, first time caller, big fan of the

show. I'm a venture capital investor. I'm a partner at Firstmark, which is an early stage venture capital firm based in New York. And,

I've been a

big fan and,

very active investor in the general space of data infrastructure,

machine learning, and AI

from the infra layer all the way to applications.

And do you remember how you first got involved in the overall area of data and data management?

Yeah. I started my career in technology

as an entrepreneur. I was the cofounder

of a company that was called TripleHub

Technologies,

that, did enterprise search

and knowledge management.

I like to joke that today, it would be

a a really hot AI company because,

ultimately, it was all about unstructured data management, and we used a lot of Bayesian techniques at the time.

But I remember that we had to work really hard to

convince my now peers in venture capital that AI was really a thing. In particular, we had

a CTO

work in Delgado,

who had a PhD

in AI.

And after a few pitches,

we learned that, we needed to downplay

the fact that his PG was in AI

and instead position him as having a PhD in computer science just because VCs were looking at us, like, you know,

kid, have you have you not gotten the memo AI is dead? You know, obviously, that was pre deep learning and in the middle of an AI winter. So different

times. And so now that brings us to today where you have been compiling and publishing this mad landscape

report for the past few years. And I'm wondering if you can just start by giving an overview about what that even is and some of the story behind how it got started and why you thought that that was a useful exercise.

I've been doing this for

11

years now.

And, yeah. Look, the the the general idea is, quite frankly, it's it's it's it's originally, it's for my own benefit. It's an exercise that,

I feel I need to do to be able to keep track of what's going on in the space.

And,

so the the work is really the the key goal is doing the work.

And I personally think that the era of the journalist VC has long gone to the extent that he ever existed.

And, to be a good investor and a good board member, you need to have a deep,

expertise in the space or at least spend a considerable amount of time,

in the space. So that's 1 of the tools I I use to do that. So it's really the forcing function of publishing that landscape every year that, makes me,

just keep, on top of, the companies, the trends, and that's that's that's really all there is to it. Then, you know,

there's 2 different approaches. I could, decide to keep that as a sort of proprietary knowledge,

or I could decide to sort of open source it. And the approach has has always been to do the latter,

because,

I think, it's just more fun, and I just get more out of it, by doing that. And, it's

just started the

never ending number of conversations with the the community and just find it both useful and immensely enjoyable. But, you know, if you if you publish all of your hard work as a VC, then isn't aren't all of your competition just going to snipe all of your potential,

investments?

Look,

It's even it's even worse than that. Actually, I don't know if that's a a podcast conversation, but,

it's it's happened more than once over the years that, you know, this the landscape, but I also write this long kind of state of the union blog post. And over the years, more than once, I've had some some VC colleagues,

in meetings with me, 1 on 1,

say something

in a certain way with the sentence, know, organized in a certain way. And I was like, I literally wrote that.

And you're playing it back to me, during this meeting without realizing that you are, which is absolutely priceless.

Yeah. Look. It's it's

the way I think about it is is, again, like open source.

Yes. There are downsides to open sourcing the analysis, but, you know, eventually, the information is all out there and, people can compile it. And,

you know, I don't I don't write everything that's,

going through my mind as I do this. I write mostly what's

out there in terms of facts, in terms of analysis, and it's more of a compilation exercise.

And and analysis of market trends as opposed to, where this is is Lee. I think, the world is going,

and or where the most interesting opportunities are. So I tried to strike a balance. But ultimately,

I tried to

make this, as a free resource to the community

for for, you know, the bulk of it. And in terms of that decision of what is it that's actually useful to include in the landscape, what are the things that are in my best interest to include in the landscape versus leave out? I'm just wondering what your initial approach was for figuring out what is the actual

pieces of data that are relevant to the broader community,

know, what are the questions that people are going to be asking about it, and how can I help to answer those questions based on the information that I include

and just kind of how that process has evolved from your very first addition to where you are today, where you've been doing it for, I think you said, 11 years? Yeah. Look. I try to be broadly inclusive.

I think the the difference this year is that,

we've had a much more opinionated

approach to selecting which companies we get into the landscape or not.

In prior years, we tended to

give

a particular

priority to companies that were a bit later

stage.

Either they had

higher

revenue levels or that raise more money or that kind of thing.

The this year,

we decided to also include a bunch of companies that we found very interesting. That's particularly,

because we wanted

to give a good amount of real estate on the landscape and in the writing to generative AI companies. And, literally,

most of those companies did not exist 6 months ago. So if we said, we're only going to include series b or

later companies, then, all the generative AI companies will not be in there. So that's 1 of the ways we're opinionated. But look, at some point, we have 1416

companies on the landscape. We could easily fit another 1, 000 or 2, 000 on it. So we have to

make decisions.

And, you know, by the way, we we, you know, we we miss we, some companies, we get it wrong all the time, but, the community is sure to let us know, and that's 1 of the ways it's been really interesting. We has learned a lot from it. And also, 11 years ago, the overall space was much smaller, so you could fit it all in 1 graphic, which I'm sure is why you had that initial ambition of, hey, let's put it all in 1 place. And now now I I've only had known. Yeah. I imagine you're regretting that now. I'm gonna gonna need to get 1 of those giant rolls of butcher paper. Yes. Exactly. Making a mental note next time to pick a a space that doesn't expand.

Maybe COBOL programming. There you go.

And

to your point of, you know, we publish it, we get it wrong, people give us feedback. I'm wondering what you see as the potential for

being able to make this a properly open source activity where here's the code that we use to generate the the visual, here's kind of the set of fields that you need to fill in. You know, let's just update it piece by piece by pull request so that people can add and remove things as they see fit. Like, what what are some of the potential risks that you see in that approach? Yeah. That that's an interesting thought, and that crossed my our minds, a couple of times over the years. I think the,

you know, the 1 of the key editorial decision that,

wouldn't work there, would be much harder,

is that, a lot of companies, and quite often rightly so, think that they should be in many different categories at the same time. So in some case, it's probably true if you look at the cloud hyperscalers,

if you look at Databricks.

Databricks at this stage could probably have a logo in most of the boxes, especially on the left side of the landscape.

However, it gets a little more complicated when you have a seed stage company

that says, hey. You know, we should really be in 27 subcategories. Absolutely. And

whenever I look at the mad landscape, it also puts me in mind of the CNCF landscape that they've built, and I'm wondering kind of, did did they copy you? Did you copy them? Is this just kind of a general paradigm of how to organize information? Like, I'm wondering what you see as the interplay between these these types of visualizations

of an overall ecosystem.

Yeah. It it all started, I think, with the Loomis cape back in the day, which was,

a Martech

landscape started by I think that was LUMA, the

investment bank. I may forget the exact details, but, that was the the granddaddy or the OG of, all those market maps, as far as I know,

Where we did get inspired by the CNCF landscape

this year in part,

is that for the first time, we'll launch an interactive version of the landscape. You go to

mad.firstmarkcap.com,

you'll see the

interactive version. So, you know, I like to joke that,

it's a big innovation

because, you know, apparently, this world wide web thing is, is major and you can, actually get on, this thing they call the web, a website where you can click on the logos and have an interactive experience. So we're very proud to have done this,

this year. But it's it's, it's actually been very, helpful.

And, you know, you can also bring a card view, and as you click, you get stuff that was, data that was provided by our friends at CB Insights. So it's it's, it's good. Now the mad landscape is really a combination of the PDF landscape, the interactive version,

and the write up, the kind of state of the union write up that, we produce around it. And

as far as the

overall landscape and the ecosystem that we're operating in and trying to catalog as these snapshots in time as you do this every year or roughly every year,

I'm wondering, over that period,

how you have seen the influence

of the types of projects and companies that are founded as we go from this early stage of big data, everything is Hadoop, to where we are now, where everything is the modern data stack and kind of the different splinters that kind of break off as we go along that journey.

Yeah. Absolutely. So indeed,

the the big,

initial

sort of burst of energy into that ecosystem was really, a dupe and all the related technologies.

And, actually, if you fast forward to this year in 2023,

for the first time, we actually killed the Hadoop

box, and we had kept it on there because the Hadoop footprint,

is actually much wider and stronger than 1 would suspect. So we kept it on on there up until now, but now we've merged it.

We've merged the vendors and the companies into the data lakes and data lake house, box, but that was a little bit of an the end of an error.

And,

you know, separately, indeed, the modern data stack

was the the big next phase. And,

as we all know, right, the the creation of this whole ecosystem. So not just the cloud data warehouses, but all the tools,

you know, before and after.

So in the last landscape,

we had the emergence of brand new categories like reverse ETL,

you know, on the left side of the data warehouse. And on the right side of the data warehouse, we had metric stores. So there's, like, all those new boxes that that, that appeared. But, yeah, the landscape is very much followed this. You know, the the the poor parents of the landscape for many years was the right side,

which is,

applications. So the way to think through the landscape is that the

roughly,

the the left is data infrastructure. So that's where stuff is stored and computed and processed.

And then the middle

is,

is data analytics. So data leaves and gets analyzed. And then the right side is, like, data gets used, so those data applications. And for for a very long time, the the action was,

on the left side.

And it really feels that this year, in part with generative AI, a lot of the action has has truly and sort of finally started moving in earnest,

to the to the application side of the house where,

it's become

more than more apparent than ever how you use all those technologies. So I, you know, think again of, like, data moving from the warehouse, wherever it's stored to,

BI,

on 1 side of the fork, and then on the other side of the fork, machine learning and AI, and, and then a lot of, like, ML and AI related applications.

Another interesting challenge of this ecosystem and this problem that you have created for yourself is that question of categorization, whereas you mentioned Databricks could have a logo in every single box, basically.

And also there's the question of

how do you define what the categories are and what their boundaries are because a lot of these different tools and products,

you know, maybe it started with a very narrow vision, but it has grown into

encompassing some of the adjacent concerns.

New capabilities, new categories have arisen. You know, maybe sometimes they're justified. Maybe sometimes they're a flash in the pan. And I'm wondering what you see as

a a useful

exercise to figure out what are the useful categories to break these things into,

particularly as, you know, it has gone from a very linear flow of data starts in 1 place and ends up in the, you know, business intelligence dashboard to where we are now where it's become cyclical through things like reverse CTO and

AI and AI being used in data infrastructure, and it's it's it's it's a tangled web more than it is a linear flow if it ever was 1.

Yeah. Absolutely. That that's where you have to be somewhat opinionated

and, and make calls and look,

we certainly don't pretend in this landscape. Again, it's an exercise

rather than a definitive

statement on how things are, and

in many ways, a way of, like, starting and generating,

conversations.

But, yeah. Look, you have to be,

willing to explore and you have to be agile on your feet. And, sometimes you add a category and sometimes you kill a category. So,

for example, we had the metric store that was added in the last landscape,

and we decided to remove it in this landscape.

And,

you know, what's interesting about this example is that the the need for a metric store is very clear,

and that's certainly very important.

However, as a separate box,

it's

felt less justified because, you know, DBT

launched their own metric store, and then they acquired Transform.

And then you had another company in the space called

Supergrain that pivoted.

And then you had another, company in the space in which I'm a private investor called Trace that added a whole application layer on top of the metric store.

And, yes, there are other companies that position as metric store, but you end up with, like, 1 or 2. And by the way, those companies do other things as well. So it sort of felt like killing the category,

as a separate box.

Again, fully acknowledging that it's an important functionality, but killing the category as a separate box was, the right thing to do.

Another interesting element of this problem is that, you know, there are some

problems that are solved by companies that have their own set of founders and their opinions. There are some categories that are largely dominated by open source projects that don't necessarily have a strong kind of corporate owner, many of them do.

And

I'm wondering what you have seen as the overall impact of

investment and venture capital

on the evolution

of the data landscape and which problems

are focused on and paid attention to and, you know, which of the problems maybe deserve attention but are being funded for whatever reason, whether it's, you know, societal, economic,

pick your reason,

particularly in

the situation of the past few years where capital was very cheap and plentiful and easy to come by, and some of the ways that that has impacted the way that this landscape has grown. Yeah. Look, I think we are coming out of a pretty frenetic

period of time in the data infrastructure

landscape,

which was in particular

accelerated

by

the Snowflake IPO, which, to this day

is the most successful and biggest, software IPO ever.

I think,

that had pros and cons.

The the the pros was that for,

a long time, we were in that,

let a 1, 000 flower blooms kind of mode, where

if you were,

a technical founder,

with,

you know, enough experience in clouds and, like, in a in a vision, you could get money,

which is,

which is great.

So people were able to just start companies left and right and experiment.

And, and that was very exciting in many ways. There were, you know, plenty of, interesting comedies that were started

all of a second.

The obvious

drawback

of this is that,

we ended up with comp with categories that were overcrowded overnight.

And,

you know, overcrowded by companies that were all at the same time. And,

you know, everyone was in the mode of, okay. This is a real problem. This category needs to exist.

There is maybe another 1 or 2 companies in the space, but they're all early. Therefore, if we start,

or fund

another company, we'll just have just as good a chance as anybody else.

And,

yeah, as a result, we, you know, now the music has stopped in terms of,

financing, and,

everybody's looking around and trying to figure out,

which 1 of those companies are going to be able to

survive. You know, we we certainly,

now in the market's phase where the market is not gonna be able to sustain,

all those companies,

which again tend to be,

what I would call single feature companies

through no fault of their own. It's just the nature of a of a of a young startup is that you start with something. You start with a with a product that typically looks like a feature,

and that's,

that's how you should be doing it.

But you need more time on earth,

to be able to

turn that feature into a product and a product that's truly enterprise ready and adopted and deployed in production by many customers.

And,

a lot of those companies were started in 2020

or 2021,

or 29th to 2018. But, at the end of the day, there are 3 2, 3, 4 year old companies

that are below,

say, 5, 000, 000 in ARR

and in a bunch of different categories. And that's very much an uncomfortable place to be,

for those companies.

So there's an oversupply

of companies and oversupply

of,

products,

and that is in the context where

the buyers of technology,

the customers,

are under

a very clear pressure,

from their CFOs and CEOs to,

cut costs or to keep costs under control.

A situation where

the VCs

are under their own pressure

to,

focus on their portfolio and be very discerning in the bets that they make going forward.

And, finally, a situation where

the potential

acquirers of those companies

are going to very soon find themselves in a situation where they could pick any of 10 companies,

as the potential

target that they will eventually acquire, which obviously will have a deflationary

impact on the price of any of those potential acquisition,

acquisitions. So it's,

it's,

you know, all of a second, a

little bit of a tough situation.

Look. Some companies will navigate this very well through a combination of skill,

having money in the bank,

and a bit of luck, and will emerge on the other side of this stronger, leaner,

fitter.

And,

you know, I think that will work out,

great for the entrepreneurs and the VCs.

But, I think for everybody else, it's gonna be

more challenging.

Another interesting aspect of where we find ourselves today is the combination

of

capital isn't being distributed as freely as it has been the past few years,

and it seems like a lot of the,

kind of focus and hype has

reoriented

from the data infrastructure level to these,

AI and ML focused

product categories, particularly with things like generative AI and the, advent of transformer models and the kind of general capabilities that are being built up there. And I'm wondering what you see as the interplay of kind of the

state of data infrastructure

as a broad category of problems and the kind of level of maturity that we've reached there,

and the broad attention that's being paid both in terms of the enterprise and venture capital and technological investment, as well as the, kind of flashy headlines that are coming out with things like chat GPT, etcetera,

and how that's going to influence

where you see kind of new companies being founded.

Yeah. So for sure,

you know, the the VC train has moved

away from data infrastructure

into generative AI.

From a data infrastructure

founder perspective,

it's both a

curse and a blessing, I would argue.

It's certainly

a curse,

because,

it's gonna be tougher to raise that next round,

especially if you're in a situation where you raise the prior round

at evaluation

that was way ahead of the reality of the business.

At the same time,

the blessing part, I think, is that,

you can now,

focus truly on the business, the product, and the customers with the relative comfort of knowing that you're not gonna wake up

tomorrow morning,

to, you know, 3 announcements from new companies

that were just started and founded in your space,

or a competitor that decided overnight

to get into your space after they raised yet another big round.

So that

is helpful, I think, and the market is going to thin out. So if you are 1 of those companies that sticks around that is, you know, in this kind

of survival mode or

sort of fitness mode, right, where where you are truly efficient

and,

really building product and selling customers and making customers happy and all the things. In in in many ways,

if you're not completely overused keys in terms of valuation,

you're in a you're in a,

you know, in a in a pretty decent and challenging but pretty decent position.

And,

again, I think

some companies will emerge from all of this as the leaders in the in their category,

and, we'll we'll we'll find that what's currently happening is the best thing that could possibly happen to them

as opposed to

just, getting more money for free from VCs and getting distracted and having more competitors and and more noise.

So that's data infrastructure. If we look at the world of of generative AI, and I don't know if that's a separate podcast or not, but if if we look at the world of generative AI,

my concern is that,

that situation

in data infrastructure that we are now in the process of untangling

is forming again in generative AI.

And, look, it's ultimately nobody's

fault.

You know, it's easy to blame the VCs or the press or Twitter or

or or what have you. It's the logic of the

capitalistic,

system that,

VCs and founders, everybody in technology is looking for

those disruptive moments when something very meaningful is happening.

And, clearly, that's the case in generative AI. So, clearly, that's gonna attract a lot of attention.

It happens to be in the particular context of an otherwise very dire market. So general AI is not just a major inflection point and possibly the next big platform of the future,

but it's also the 1 bright spot,

in this, you know, challenging economy and certainly challenging tech world.

So, yes, people are rushing into it.

The net effect of this is that all the stuff that happened in data infrastructure

is happening

now. So you have a bunch of

companies,

that probably should not be started.

You have a a bunch of, you know, very technically strong

machine learning,

engineers

that are starting new ventures that, in an ideal world, will be excellent founding team members.

But because they can and because they are quoted by VCs who tell them, hey. If you wanna start something,

you know, here's 5, 000, 000

or 10, 000, 000 or whatever the amount is.

Those people are starting companies, and that's gonna take a couple of years to play through the ecosystem

of, you know, those companies are going to start. They are going to try and build a product that may or may not get to product market fits, especially given all the noise. At some point. So they will come to the conclusion that they will probably need to be part of other organizations. But, you know, when I say 2 years, it's probably more like 3, 4, or 5.

So, you know, it it is it is, it is what it is.

It's the logic of the system to go through those booms and buzz. I think, directionally,

it's produced

great companies.

But, you know, for somebody like me,

I've been at Firstmark for 10 years now.

Just during those 10 years, that's my 3rd hype cycle in AI. The first 1 was up to 2012. The certain second 1 was sick sometime around 2014,

2015. Feels like another 1 now.

And,

you know, those things,

typically don't end as well as 1 would think. So it is a little bit,

you know, there's a little bit of that feeling of, okay, here we go again. But, you know, what can you do? It is exciting,

and, you know, I'm I'm excited for the founders who start businesses in the field. I'm I'm making investments in the field, and you to play the moment.

Just,

you know, 1 just needs to be ready for the inevitable backlash that that will happen at 1 point or another. So more than ever,

it's built,

AI businesses for the sake of building a business and making customers happy and not because,

you know, you can do cool things with generative AI. And the other interesting thread between these 2 moments is that

you can't build AI if you don't have solid data infrastructure and data engineering. And I'm wondering

from your conversations with people on both sides of that,

how much you see people understanding the kind of dependency chain, and if you have seen any kind of concerning elements of people jumping straight into AI saying, well, all that stuff's done. I don't have to worry about that. I just throw data at it, and it's good.

Yes. And, I've I've I've certainly heard that, and I've heard the flip side of, you know, what's really hard is the data engineering stuff. Like, the AI stuff is done. That's that's easy. That's just what you add on on top. So,

you know, it's it's a really good question because that's a question we've been asking ourselves every single year when we do this mad landscape is whether we should keep everything on 1 chart, especially as the real estate has become more and more expensive,

because there's only so many little logos we can fit on,

1 1 page.

Whether you know, the question has been whether we should do 2 landscapes,

1 for machine learning AI and 1 for data infrastructure. We've decided to keep everything on 1 precisely because of that symbiotic relationship.

But,

that that's 1 of the reasons why,

I

am very excited about AI. There's certainly the generative

AI part,

for sure.

But almost separately from generative AI, I think we also are at a

phase of the cycle

where

a lot of companies

are much closer to having their data house in order than ever.

And, indeed, having your data house in order is the absolute

requirement

before you can do anything meaningful,

with machine learning and AI.

So I'm excited to be at that phase of the cycle. And,

of course, that's the

result of the modern data stack, which we talked about. But the the the the rise of the data warehouses

and the data lighthouses

for the first time,

has brought us to the level of maturity where,

enterprise AI becomes

truly a possibility

at scale

and in the ubiquitous manner. So,

look, not to be the VC that, talks about

these companies all the time, but,

have been,

a very,

proud,

investor and board member at Dataiku,

which has now emerged as the leading

enterprise

AI data platform, and,

the acceleration over the last few years of that company as a bellwether of the broader

industry has been really interesting to to to watch. And,

they have a close relationship with both Snowflake and now Databricks. And you can almost see it mechanically.

The companies that have deployed Snowflake at scale

or Databricks at scale or any, you know, comparable situation

have been turning their attention to enterprise AI, a part of which is generative AI, but most of which is actually not generated AI. Most of of which is,

you know, fraud detection

and

churn prediction

and, you know, supply chain management, inventory management, all those use cases that actually don't require,

GPT or NLM or or or what have you. So,

it's it's we're really at that phase now where,

enterprise

AI is going from being the the poor parent of BI,

to being, 1st class citizen in the enterprise.

In this phase of

kind of adjustment or contraction or however you want to phrase it, I'm wondering if there are any

general product categories that you see as being particularly ripe for consolidation

or being

swallowed up by adjacent concerns or adjacent problem spaces?

And,

if there are any kind of niche product areas that maybe look like they're an opportunity

to be subsumed by other products or other companies that you see as actually

likely to remain competitive as we move into this uncertain future?

Yeah. There there's plenty. And,

by all means, a little bit like the metrics store I was describing earlier, I I don't mean any of those categories. I don't mean to say that any of those categories

are not very important categories and that important companies will not emerge from those categories. Not at all.

But having said that, there are certain categories which are clearly

ripe for evolution,

consolidation,

all the things.

1 of them, I think, is a world of data observability.

And to some extent, that's already

happened

to some extent. As you remember, we used to have different categories or subcategory that was like a whole data lineage,

world. And then there was, data quality. And within data quality, there's declarative declarative

data quality, and there's this, like, machine learning driven

data quality, and then there's data observability,

which covers some part of this or all of this at the same time. And we've already seen data lineage sort of disappear as a subcategory.

To me, all of this is more or less the same thing. And look, if you if you speak to the companies that you do all the time, everybody is going in the same direction, which is to be

the Datadog of data,

which,

again, is 100%, a

a a a beautiful prize if you get it in the a very

ripe opportunity

to build important companies.

But this all these companies are going to need to work together. And arguably,

you could you could say, that,

orchestration

is

a part of that whole discussion

as well. Because

if you're a customer, ultimately, what you want is not a data lineage vendor, data quality vendor, data observability vendor, and an orchestration,

vendor or open source project that you use.

Ultimately, what you want is for your data to be of a high quality.

If that's an issue, you wanna hear about it quickly.

You wanna know

where

the data issue comes from,

and then you wanna be able to fix it. And,

all the things should kind of work together. And what I described is a combination again of, like, data quality data in it

and, you know, orchestration.

So

I I think the this whole world needs to evolve

towards,

simplification.

As painful

as consolidation

might be from

a vendor perspective, so founders,

VCs, and startups.

From a customer perspective, consolidation

is

going to be

generally very helpful and, I think, very welcome because

that's less technologies that you need to

become fluent with as a user. That's less contracts that you need to manage.

That's definitely cheaper because you're not in a situation where every vendor needs to increase their revenue and increase their margin in order to get to the next round of financing.

So that I think the the the customers are gonna end up, you know, being, the beneficiaries

of a lot of this. So that's 1 example, data observability, data quality. Again, not to pick on them.

Another category for me clearly is, MLOps.

So I don't know if that falls into data infrastructure or AI machine learning.

That's another category where you've had

the dozens of companies founded over the last, few years, and some of them are closed source, some of them are open source, and

some of them started doing data

model

management,

data model

governance,

or,

you know, AI fairness or AI transparency,

and everybody's coming from a different angle. But fast forward to

today, especially in a context where VC financing is less abundant,

everyone is, realizing that, okay. Well,

you know, I have this product now, and I've been able to get to 2, 3, 4, 5,

10, 000, 000 in in revenue. But,

to grow into my valuation and Windows categories, I'm going to not be just the best company

in AI

fairness,

but also I'm going to need to expand and effectively become an MLOps platform,

which, again, is not a feature, not a bug. That's that's how you wanna grow as a as a startup. Just thinking this is going to need to get accelerated by the overall pressure

on the on the category.

So, this is already starting to happen, and you see companies,

starting to do lots of different things and evolving towards a platform.

But you're not gonna

the market is not going to be able to sustain having 30 different MLOps platforms. So something is going to happen to that category,

for sure.

But again, as always, some companies will do great, just not everyone.

Another interesting element of our current point in time and space is

the fact that because of things like the Snowflake IPO, because of the general

kind of evolution from Hadoop of this is operationally very heavyweight and difficult to manage, but we want to be able to get this power of being able to compute across massive data to

we can do this in a cloud data warehouse, but it doesn't solve everything. You can't use SQL for all of your business logic. You know, hence, we have the modern data stack as kind of the de facto architectural paradigm.

I'm wondering if you foresee

any impact on kind of what that de facto architecture looks like as we move into this phase of contraction, as we have explored a lot of the potential space and

tried figuring out kind of what is the proper balance of cost versus compute and scalability versus storage space, etcetera.

Yeah. This

last 18 months or so for the first time,

I've seen

the

core principle of big data

being challenged for the first time. And by core principle of big data, I mean, this general idea that

you should,

collect and store

all

of your data

and, quite frankly, occasionally, just figure out what to do with it later, which was, you know, it was a whole do by gear. Like, stop throwing out your data,

just dump it into this big bucket, and,

then magic will ensue.

So as we all know, we took, a little bit of time for magic to actually ensue after that.

But, you know, that that that big data logic

very much translated

or carried over to the wall of data warehouses.

If you think of

Snowflake,

ultimately, the beauty of it is that it's this infinitely

elastic

warehouse in the cloud and the data lakes or data lake houses, like, central

idea.

Of course,

storage is 1 thing.

Compute is another thing.

As it turns out,

if you dump a lot of data,

into those repositories and,

you try to compute

all of it or even a significant portion of it, at scale and repeatedly,

is going to cost a lot of money.

We are not clearly in a world where it's not okay to spend a lot of money, except you if you have a very clear ROI for it. The

c

f o,

of each customer is now breathing down the neck of the data teams and data engineering teams, so a different paradigm.

So

I'm seeing,

an evolution of the conversation

towards,

do we need all this data?

If we do need all this data, what is it for?

And if we have a clear business objective for all of this data, then are there

cheaper, faster, easier ways of, processing it? At the high end, I'm seeing,

the beginnings of,

new

coming up.

So when I say new, look, it's stuff that has been, percolating for over a long period of time that that seems to be accelerating

and

a

kind of architecture I'm seeing discussed a lot these days is, okay, well, let's do

s 3 for storage, because,

that's not very expensive.

And then, to add a little bit of a structure, let's add stuff like, you know, Iceberg or Oodie.

And then for the bit of,

all app, you know, dev DB is sort of emerging kind of out of nowhere as, like, everybody's favorite solution to at least talk about.

And then, for, you know, query,

Trino.

So, you know,

possibly different tools, not necessarily those ones, but that's,

emerging as,

something that's that's a little different from your sort of centralized,

let's double the data in the Snowflake kind of architecture. So I'm seeing that at the

at the upper end of the market, and and and the reason for that is that you need smart data engineers to be able to figure that out and collect all those,

connect all those things, teach all those solutions together,

and figure out how that works.

At the lower end of the market, which, by the way, is sort of, you know, 90% of of companies,

I'm seeing

the acceleration, it seems, of,

the the fully managed,

data platforms, which,

you know, felt like arguably

a weird idea 2 or 3 years ago

and,

now seems

pretty interesting and logical. So, you know, I don't know how big a market category that is, but I'm I'm certainly seeing and hearing a lot. So by that, I mean, the, you know, y 40 twos and Mozart data of the world or

Kubula,

which is, you know, a a different approach, but, like, with the same end result.

The the the former, that's really this idea of, like, abstracting away the modern data stack and using all the usual suspect vendors,

but to sort of, like, stitching them together and offering the customer just 1 contract

and, just 1

relationship.

And,

you know, ultimately, I assume with the general goal of being able to then turn around to the underlying vendors and negotiate better prices.

So I think that's that's, that's interesting, and I'm seeing,

you know, people,

finding that interesting in a context where people would want more simplicity,

more convergence,

and,

just don't have as much of a budget to hire a bunch of data engineers to do all the stitching together.

As the

landscape has evolved, as the overall interest in this space has evolved, as you have gained a greater appreciation

for the kind of nuance and detail that's necessary to be able to differentiate between

some of these different vendors, the different product categories,

You know, why do we even care about this thing?

How has that affected the way that you think about the presentation

of that information in your collection for the mad landscape and the ways that you think about communicating

around those vendors and product categories and tools?

I've I've tried to try the right balance,

between,

evolving, but also keeping

the broad architecture of the landscape

consistent over the years to make it easier for people to

sort of, compare.

And I it is actually, it's it's it's fascinating. It's like a whole group of of, you know, people in the community that that should take those landscapes and compare the images and and all the things. So, like, I've I've, on the whole, tried to keep it reasonably consistent.

In the process of building this landscape, investing in this space,

working with some of these companies to help them understand kind of what are the competitive opportunities,

what are the ways to think about positioning your tools and products, what are the problems that need to be solved.

I'm wondering if there are any particularly interesting or innovative or unexpected

entrance into the market that you have seen, whether as far as the way that they're attacking a particular problem space or the ways that they're thinking about

trying to make themselves

valuable to the broader ecosystem or indispensable in a particular

way? I don't know if there's initially innovative

or different, but, what what I've seen again and again work is this general approach of, starting as a tool

and sometimes arguably even a a toy

and,

evolve bit by bit into

a product

and then over time a platform.

And, what what I've seen conversely,

you know, often,

is companies that that that try to be more of a platform

early on, and that's,

that's challenging.

That's challenging. So that's 1 thing visible this,

know, starting,

with a game plan in mind with something that may feel like

a a little thing,

in in the future, but, like, over time, evolves,

as you get product market fit,

around that product. And that's true in general, but that's certainly been true in in data infrastructure.

The second, thing that comes to mind

is

there is a,

series of companies,

that have actually

won their respective

markets,

by doing something that's counterintuitive,

for a lot of deeply technical founder,

founders,

which is

building

for the many

and focusing

on

democratization

and collaboration

as opposed to

trying to build,

the most bells and whistles

for the more technical users.

Because as it turns out, pretty often,

the number

of,

very technical

users that will appreciate the fine nuances

of all the features that you build for them can turn out to be pretty small, in an enterprise. So the, you know, the the perfect example is, like, this whole generation of, platforms that built purely for data scientists that, that

platforms that built purely for data scientists.

That's,

you know, great for initial

product market fit.

But as it turns out,

to this day, it's very hard to hire data scientists. There's just not that many of them around.

And then,

they, to these day, don't always have the biggest budget

versus

an approach of saying, hey, you know, data in the enterprise,

it's not about the most taking all users. It's actually a combination of tools and processes

and, humans. And,

it takes a village and different people are going to be involved. Therefore, we're going to build tools that are approachable

by many folks. And, you know, sometimes it's combination of,

being very technical,

but also no code.

And then, we're going to empower people around the organization.

And the the 2 things I just mentioned, this this evolution from Turbo platform and

collaboration and democratization,

those can be very related. You can start with a very technical tool and evolve towards a broadly democratized platform. But those are 2 of the strategies that have since succeeded over the years.

And in your own experience

of working in this space and trying to kind of gain perspective

and understanding

of the problems that are being addressed and how to

solve them in useful and economical ways.

What are some of the most interesting or unexpected or challenging lessons that you've learned personally?

In in the 1 of, VC over the last, few years, for people that focus on data infrastructure,

the

typical heuristic,

has been,

hey. Let's,

find out founders

who are

have

built,

a platform

within

big tech company x, y, or z. So Uber,

LinkedIn,

Lyft,

you know, and several others being the the usual suspects.

Let's take that product,

open source it if it's not already,

and turn it into

a brand new company that, we, VCs, will will fund.

What I've learned over the years is that that approach

works, but doesn't work as systematically

as the rabbit funding around that model would

lead 1 to believe.

I think it turns out that,

a lot of the

problems that you experience

at an Airbnb

or LinkedIn or any of those companies

are years ahead of,

where the market is.

And,

yes, you'll have the most

thoughtful approach,

around the most,

you know, vexing problems of getting to massive scale. But in terms of where the where the bulk of the market is,

you're going to be just too disconnected.

So that's that's 1 lesson,

I think. So I'm not saying,

those deals are not great or those companies are not great. I'm just saying it's much less

systematically

successful or heuristic as as 1 would, wouldn't believe. So that's that's 1 lesson.

Another lesson is that,

you

can

be right

on the analysis, but you can be wrong on the outcome.

Meaning that,

time and again, it's very hard to predict where the market is going to be, and you can have the smartest founders

building the most interesting product.

Ultimately, this is an exercise in company creation,

not in building

the best product even though you hope that 1 will follow the other.

And that's why you

sort of end the back on the,

you know, the most obvious cliche of venture capital, which is to invest in the best people you can find. As it turns out, the best people

is not,

necessarily just the most technically

strong,

people you can find, but people who are truly

starting companies because they want to

focus on making customers happy and truly enjoy

this interaction with customers. And I think the

market,

the VC market has has led

the VC market being frothy as it was for a couple of years has led to

the creation of a bunch of companies,

that, people started because they could and because the technology was really great,

but ultimately,

you know, in in part encouraged by the whole, like, PLG bottoms up

marketing,

kind of way of going to market where you don't really need to talk to customers. You don't really need to get on that awkward

sales call, and you hope that people just, like, show up magically through your self-service

and and all the things. I think that that whole, combination of

of, the frothy market plus PLG motion has led to the creation of

a a bunch of companies where where where people don't truly enjoy

working at an end with with customers.

So

you can be

wrong on the analysis,

and right on the outcome or vice versa.

Therefore, pick the best people.

Therefore,

lesson learned around the best people, not necessarily the most technically astute folks, but folks who are both technically very strong

and truly, in their heart of heart, enjoy working hand in hand, with customers solving business problems.

And as you continue to work in this space and invest in these categories,

what do you have planned for the future of the M. A. D. Landscape? Either ways that you want to think about,

updating its presentation or content or ways to make it a long term sustainable activity, either for yourself or putting it into the hands of kind of the broader ecosystem?

Just kind of wondering what you have planned as you look forward.

Yes. I'm

I don't I don't even really wanna think about, the map 20 24,

considering

I'm just,

exiting the map 2023

period, which was, an effort and a half.

But look look,

I I as stated above, I very much want this to be a conversation.

We are already,

going to make a second version of the map 2023 landscape

based on all the feedback we got. So we created an email address

for comments, thoughts, and suggestion, which is mad 2023@firstmarkcap.com.

We got hundreds of emails. We've been parsing through those.

We're going to create a as, you know, a second version, as I said, of the of the landscape,

trying to capture

most of those comments, probably not all of them, but most of those comments. So that's the immediate future.

In terms of, 2024, yes, I do like the idea of, open sourcing this even more to the community.

But, again, going back to the first principles

of why I'm doing this, I'm doing this for the forcing function of,

me doing the work. And then, you know,

through conversations like like this, get the team to complain about it. But, you know, I'm I'm I'm French, so complaining is 1 of the things I do best.

So if I completely open source it to the community or had other people do it, then that would sort of defeat the purpose of,

really

doing the work. So I'm, probably mostly going to continue

as is for the foreseeable future. Although, again, I'm I'm open to all sorts of thoughts and and suggestions.

Are there any other aspects of your experience

investing in and engaging with the data infrastructure landscape and your work on the mad landscape

for those purposes that we didn't discuss yet that you'd like to cover before we close out the show?

No. I thought I thought that was pretty

thorough. I would

look at a message of hope,

I guess.

The market is what it is.

I do think that,

data infrastructure

in general

is,

the gift that keeps on giving.

I think we keep going from phase to phase

from,

you know, the world of, the old

NPP databases

to Hadoop to the modern data stack

to

whatever it is that we're doing today.

There's always something new. It's always a very fertile area to start

companies. I think, you wanna be very

careful

in this market when starting something. You wanna make sure that you are in it for the right reasons. You wanna make sure

that you truly enjoy working with customers on a daily basis and not just,

build product. Having said all of that,

another VC cliche, but that's very true.

This kind of market is a wonderful time to start a company.

Talent is available. There's much less noise.

You have more time to iterate. You have much less risk that

5 competitors are going to emerge overnight.

So it's a great time to start the company,

a company.

And, you know, while everybody's

busy trying to build the next, thin layer on top of a chat GPT that gives you time to be thoughtful always

a

it's always a great area to build a company.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I think there's a really interesting

set of opportunities around,

Jira Day of AI for data a little bit as a perfect end to this whole,

conversation that I I find very interesting. I

I have seen a bunch of companies sort of jump into this opportunity.

But a little bit to the other conversation around democratization

and making, data infrastructure,

available to a broad set of people within enterprise.

I think,

the opportunity

for people beyond data analyst to interact with data

analytics, in particular, also,

potentially, data broader data infrastructure

through English or through natural language

is really intriguing.

I don't know what it means yet.

I don't think that it,

puts the jobs of data analysts at risk,

just yet.

But I I I I I think it could be a very major unlock for the space.

To this day, we're still very much in this situation where if you think of the most basic output

of this whole data infrastructure that we've been talking about,

which is a a BI dashboard,

We're still very much in this world where,

it's the province of a handful of people in the enterprise. And, as we all know, if,

you are the

CEO

or senior ranking member of

an organization

and you want some kind of,

BI analysis beyond the dashboard that's available to everyone,

sure. The Tableau analyst or the Looker person in your organization will say, you know, right away, I'll do you know, let me get on that, and you get the result within a few hours. For anybody else, which is really 95% of the organization,

you know,

take a number and wait your turn. And that that just doesn't feel,

like,

the best way of justifying

all this investment in the

the tools. So, look, the the the the dream of self-service

analytics,

has been around, you know,

forever.

But I think,

you know, generally, the AI tool as an interface to all of this gives really new life

to the IT. And I think that's a really interesting and fertile area.

Alright. Well, thank you very much

for taking the time today to join me and share your experiences

of compiling this landscape. Thank you for all of the work that you've put into making it a reality and going to the effort of presenting it to the broader community and ecosystem. So I appreciate,

all of the time and effort that you and your team have put into that, and I hope you enjoy the rest of your day. Thanks, Tobias, for having me. Love the show.

Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at dataengineeringpodcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and just tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links