Democratize Data Cleaning Across Your Organization With Trifacta

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Outland as an internal tool for themselves.

Outland is a collaborative workspace for data driven teams like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create a single source of truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain.

Now there's a book that captures the foundational lessons and principles that underlie everything that you hear about here.

I'm happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O'Reilly to publish it as 97 Things Every Data Engineer Should Know.

Go to data engineering podcast dotcom/97

things today to get your copy.

Your host is Tobias Macy. And today, I'm interviewing Adam Wilson about Trifacta,

a platform for modern data workers to assess quality, transform, and automate data pipelines. So, Adam, can you start by introducing yourself? Sure. Thanks. It's great to be here. Thanks for having me. My name is Adam Wilson. I'm the CEO of Trifacta.

And I've spent my entire career in data integration, data transformation, data cleansing.

Most notably before

taking on the CEO job at Trifacta,

I had a variety

of executive management roles at Informatica over a 13 year career.

And I've been at Trifacta for the last 7 years as a CEO, being part of what's new and what's next in the category.

And do you remember how you first got involved in data management? Yeah. Interestingly, I had cofounded a company called Zimba. This was back during the dotcom boom.

And the focus of that company was mobilizing

dashboards. So you kind of have to wind the tape back a little bit and think about mobile in the early days of WAP phones, Windows CE devices, Palm Pilots, and there were a line of business managers and executives that were trying to deliver analytic information

to different form factors

so they could get their dashboards and their analysis on the go.

So that was the idea behind Zimba. And ultimately, that company was acquired

in August of 2000 by Informatica.

And at the time, Informatica was looking to vertically integrate

the BI and the data warehousing stack to create, you know, business intelligence in a box. So you could go all the way from

all the traditional ETL all the way through to prepackaged

analytics

that would then be delivered out to web, wireless, and even at that time, voice recognition interfaces.

So it was a brave new world back then.

Subsequently ended up running

management

roles as the company grew and formed business units through a number of acquisitions

that were done over the years. And then ultimately,

you know, when Informatica was taken private, a number of us who'd really taken the company from IPO to about $1, 000, 000, 000 in revenue

decided it was a great time to jump out and take leadership roles in the next generation of what was gonna be happening in,

ETL, data integration, data quality, and and that's what led me to Trifacta.

Yeah. I definitely remember the days of mobile web being a distinct thing from the regular web. And I was definitely glad to see that has gone away.

Nobody's having to write c HTML anymore and things like that. Yeah. I think I still have some book somewhere, either in my bookshelf or ebooks that are like, how to build for the mobile web. And I'm I'm never gonna read that.

That's right. Yeah.

Awesome. And so as you mentioned, you are at Trifacta now, and I'm wondering if you can just give a bit of a overview about what it is that you're building there and some of the story behind the company? So Trifacta actually started as a joint research project between Berkeley and Stanford.

So you had a world renowned database professor at Cal named Joe Hellerstein, who got together with world renowned visualization,

professor at Stanford named Jeff Heer. The 2 of them kind of came to the conclusion that a lot of the action, a lot of the pain, the cost, the time in any

project was actually not in the compute or in the vis, but rather was in how do you take

raw data and refine it. And as they studied that problem

more carefully, they bumped into a grad student named Sean Kandel, who

had started his career as a quant at the Citadel in Chicago.

And he had decided to go back to grad school because he was spending time, you know, on, working on the algorithmic trading, but really was spending most of his time,

you know, doing data preparation work.

And he said, this is crazy. You've got all these high powered data scientists who are all sitting around and rather than doing algorithmic work, they're

all cleansing and standardizing,

shaping, and structuring data. And he said, you know, this would be kind of an interesting problem to go back into, you know, sort of study more carefully. So when he went to grad school, he joined up with these 2 professors who was doing research in this area.

And at the time, you know, their hypothesis

was

that the reason this is still so hard and so complicated and so expensive is because the people who know the data best are often not in a position to do the work.

That, you know, unfortunately,

a lot of the work around data engineering has become the exclusive purview of the highly technical.

And while those people may understand how to do structured programming and they may understand that, you know, the the details of how, you know, data warehouses work, they often will lack context in terms of, you know, how does the data actually get used to make business decisions down the line? So they said, you know, maybe it would be possible for us to

democratize this a bit, to create, you know, a,

essentially, a solution that would learn from the data

and would learn from how the user interacts with the data in order to make intelligent recommendations and suggestions and in order to, you know, clean up or automate some of the more complicated things that people typically bump into.

They created a prototype for this that really made it much more of a user experience problem. And then for 6 months, they had 50, 000 people using it. And that prototype was called Stanford Data Wrangler.

And they actually picked Data Wrangler as the name because that was the vernacular of the the user didn't think of their problem as an ETL problem. That was more technical

term. They really thought of their problem as sort of a a problem of wrangling or munging or stitching data together.

They launched this solution, and it was wildly popular. And that's when our good friends at Accel Partners and Greylock and Ignition and a number of other tier 1 VCs came knocking and said, hey, guys. We think there's, like, a real company here, not just an academic prototype. So why don't you take sabbaticals and jump in and do this thing for real? So that was almost 9 years ago that this was pulled off campus. And,

you know, without,

I'm sure we'll get into it as we go. But if you then sort of fast forward to where we are today, you know, over 10, 000 companies using the product in some way, shape, or form,

and significant,

you know, customers out there like the Bank of America is the GlaxoSmithKline's,

the Pepsi's,

as well as small, you know, startups and consultants

and pretty much anything under the sun in terms of types of users that are creating next generation data products using this no code, low code approach to the data engineering.

And I think it's interesting that you very deliberately use the term data wrangling pretty much everywhere across the site and the different materials that I've seen versus the, you know, ETL that data engineers will be familiar with. And you kind of touched on the reasoning behind that, but I'm wondering if you can just dig a bit more into

how that

difference in terminology

influences the way that you think about the problem that you're solving and the way that you design the product

for, you know, this data wrangling versus the more formalized ELT approach.

And you make a great point there because that is something that we talk a lot about, and there's been an evolution, I think, in our thinking over time. So

in the beginning, it was very much, you know, eyes on data all the time. So 1 of the fundamental principles was that, you know, rather than writing, you know, specifications

and handing it to somebody,

and then hoping that some number of months or quarters later, they would generate a data warehouse and you'd finally get to look at the data. And then you'd be like, Now that I see the data, it's not really what I wanted. Or now that I see the data, my questions have changed.

The idea was that you would essentially interactively

build a specification that was essentially,

you know, able to be put into production when you were done. So the idea was to do in clicks what used to take, you know, months under the old model.

The hope was that you're going to allow

data analysts who, again, have a lot of context in their heads, more fully participate in this process,

you know, was a lot about the original approach. And the original approach, I would say, was more of a

no code approach than even a no code, low code approach, which has become, you know which people talk about a lot these days. But we've seen that evolution ourselves and that, you know, as

these projects spun up, the surprise for us was that as much as we wanted to democratize the ELT or the ETL work, we had to recognize that

providing affordances for the traditional data engineers that would allow them to come in and to write code where necessary was gonna be important. And a lot of that had to do with the fact that as you started thinking about operationalizing

at scale, as you started thinking about governance, as you started thinking about how the scaffolding was gonna be put in place to do this in a way that would be, you know, applicable across an entire enterprise,

pretty quickly,

you had to make sure that self-service wasn't going to turn into chaos. And it wasn't just gonna be, well, let's let end users kinda go then do whatever they want. It really became more about this collaborative curation that was gonna happen between the more technical users and the users that were very data driven and data savvy, but not necessarily,

again, structured programmers.

And so over the last, you know, 9 roughly 9 years,

we've evolved the platform quite a bit to

embrace both sides of that equation and to say that, you know, we do need to make sure that we're thinking as much about the governance and operationalization. We're thinking as much about

some of the complex use cases where writing code is gonna be required and to provide affordances for that, embrace and welcome that, you know, into the process while at the same time, staying pretty

committed to this idea that we do want a very visual experience,

you know, where people can go in and do this work very quickly without necessarily, you know, having to

be a, you know, full on, you know, computer scientist. If they know a little bit of SQL, great. You know, if they know a little bit of Python, fantastic. That's a bonus. But you don't have to be somebody who, you know, is a more traditionally, you know, trained developer in order to get productive with data. And we felt like that was really critical because in the end, for the last decade, everybody's being told be more data driven. And it's like, well, great. Well, then give me the damn data in a form that I can use it and I'm happy to be, but don't tell me to be more data driven. And then tell me it's gonna take 6 months, you know, to generate a data warehouse for me to actually start to to make use of this information.

So that was really the fundamental bottleneck that we were trying to, you know, blow up through this process of the data engineers now finally coming together to work with, you know, some of the analytic end users or analysts.

In the past 9 years, there's been a pretty huge shift in the overall landscape of data tools and the roles and responsibilities that exist across them. You know, 1 of the most notable in terms of the role shift is the idea of the analytics engineer that has come about because of tools like DBT.

And then also on the platform side, things like cloud data warehouses have drastically accelerated the ability for people to spin up new,

you

And I'm wondering how that has influenced

the ways that you have implemented the Trifacta product and just

any sort of new capabilities that you've been able to bring about because of this shift in the overall landscape?

There's been no more kind of fun time probably in the history of the world as it relates to data than right now. I mean, there's, like, Cambrian explosion of new technologies, new platforms,

you know, new databases, new architectures,

new approaches. And for us, that's actually

that's quite exciting because, you know, in many respects, trifecta is a decision you make that guarantees your other decisions are changeable.

And and what I mean by that is, you know, we're not married to a specific storage architecture. We're not married to a specific execution environment or

cloud platform.

You know, this is, you know, polyglot approach that

allows

you to compile down to whatever makes the most sense. You know, you can take your recipes that you create as you're doing your data wrangling and compile down to whatever makes sense

for for the workload or the use case that you're going after. And so I think as these new trends evolve and emerge,

you know, it gives us a chance to grab onto them as they start to mainstream. So you mentioned dbt.

So, you know, we've been doing some work with Tristan and his team and, you know,

our, you know, very low code, no code approach is very complimentary

to what they're doing that's a bit more at the command line or at, sort of more traditional kind of SQL programming.

And, you know, we get an opportunity to say, hey. Wouldn't it be great if

we you know, if you like d v 2, so do we. And, you know, we can actually take the SQL that we compile down to, which is 1 of the executions that we support, and we could actually generate dbt specific SQL. And we could actually help you profile

what you're doing in and around dbt, and we could actually help you orchestrate some of what you're doing in and around dbt, but to do it in, again, a very visual way. And that is complementary to what we've historically done around things like Spark and things around data flow with Google. And even more recently, some things we've released where you can compile down to Python and, you know, use Jupyter Notebooks

with Trifacta. And so this whole idea that, you know, as these trends

come on board, I think, you know, for a lot of our large enterprise customers,

they just want a way to embrace those trends without feeling like they've got to completely

refactor, you know, everything that they've, you know, historically done. They wanna grab on to those benefits too, but they wanna do it in a way that's reasonably seamless and reasonably graceful.

And, you know, the metadata driven approach that we take, you know, helps with that. And I think kind of increases our value proposition as, you know, more and more of these

technology trends, you know, come on board. And as more and more customers realize that there isn't sort of 1 size fits all, that there's reasons why they're using data lake architectures. There's reasons why they're using cloud data warehouse architectures. There's reasons why, you know, they wanna do, you know, some things in memory. There's reasons why they wanna use different languages.

Then sometimes they want just sort of different levels of control over the manipulation of the data. So all of those things, I think, create really interesting moments, you know, for us to jump in and, again, embrace those trends as they mainstream and provide a flexibility that, you know, allows them to move across them without having to, you know, completely rethink everything from the ground up each time

a trend comes along. And as I talked to chief data officers and chief analytics officers and, you know, others,

you know, they know that, you know, I always ask them. I say, you know, are the decisions you're making today the same as the ones you made 5 years ago? And the answer is no, usually. And so I'm like, okay. So it stands to reason that the decisions you're gonna be making around data and analytics

5 years from now will probably be different again. And if that's the case, then you wanna really think seriously about, you know, future proofing wherever you can

so that you're always, you know, able to accelerate the work that you're doing. In terms of the place that Trifacta sits in the overall landscape of data, I'm wondering if you can give some context to that and some of the tools or systems that

it might replace if you already have an existing data platform.

I like to describe what Trifacta does a little bit as the decoder ring for data in the sense that

we're not the underlying cloud data warehouse. We're not the BI tool. We're not

the algorithmic, you know, data science workbench.

You know, we're not the storage layer. You know, we really do, you know, kind of service this need of

how do I, you know, get data, you know, into the right place? How do I cleanse, transform, and standardize it in a way that, you know, kind of takes raw data and refines it to make it useful?

And how do I make sure that that data then can be continually monitored and updated and delivered to the right places. So the what we do

is very much ETL,

or I guess more accurately ELT.

This sort of how we do it and therefore sort of the who we do it for is what has been quite different.

So for us, you know, it's do we pro we profile data, we cleanse data, we transform data, we blend data, you know, we do all of the things that, that people kind of know and love about you know, traditional ETL tools or in our case, ELT,

approaches.

But everything for us is very much a very visual, very no code, low code experience

where we apply a lot of the AI and the ML

to the data wrangling process,

to infer things,

to

predictably

suggest transformations.

The nice part is, again, every dataset is not a new dataset. Odds are somebody has seen something like it before.

You know, as we now have, you know, billions of examples of people transforming different kinds of data,

we can curate that corpus of usage data at scale, and we can start to apply algorithms to say,

hey, when people faced

data that sort of was in this shape and size and format

and had these kinds of problems, like, here are the kinds of things that they

typically needed to do to the data. So we can provide those recommendations and suggestions. We can point out some of those potential issues and anomalies.

We can help them remediate

those challenges.

And it really, a lot of it was born of this idea that like, you know, just take a data quality example. Like what if your dataset could actually suggest

the quality problems that it potentially has to you

rather than you

upfront

guessing at those or trying to anticipate what those are. And so there are a lot of things you can do algorithmically now at scale

that allow those kinds of things to be done much more predictably or done in a way that's much more automated. And so I think we get really excited about

kind of taking that

more traditional kind of more legacy kind of ETL

approach and trying to say like, okay, now on a modern stack, in a modern way, with all of the compute and algorithms and things we have available to us, as well as with,

with all the advancements in user experience,

like how can we come at this anew

and really think about a guide and decide

approach that will provide a more interactive

experience for users and a more collaborative experience for technical and less technical people to come in and to do a lot of this work together. Kind of building on the original concepts that were, you know, the historical, you know, kind of ETL concepts.

And so I think, you know, that's sort of the part of the stack that we inhabit and, you know, we can take it all the way from, you know, connecting the systems and bringing data in to then doing all of the profiling, the cleansing, the transformation on the data, as well as then

orchestrating the data flows and monitoring those data flows

so that you can go from prototype to production all within the same environment and do it in a way that is very seamless and ultimately

very user friendly. And we routinely will have customers that will, you know, go from signing up for our trial to, you know, getting to those data products they're creating or they're authoring like in literally, and, you know, sometimes in 15, 20 minutes, you know, they've built stuff. They're ready to kinda use in production.

And in terms of the

architectural aspects that allow you to provide that seamless experience and just overall sort of design user experience of the system and how it's able to provide those experiences.

Yeah. Yeah. So the first thing is it's completely, you know, SaaS based solution,

multi tenanted, you know, all of the benefits that you get from the elasticity,

that

high availability,

etcetera, that 1 would expect from a traditional

SaaS offering.

And that allows us to do a lot of, interesting things in terms

of getting innovation to market very rapidly. So we're shipping product, you know, every couple of

weeks. It's a very dynamic and exciting environment. And a lot of that, you know, came to us and a lot of this sort of early architectural thinking around SAS came to us because of work that we did in the early days with Google. So,

you know, Google came to us about 4 years ago and said, guys,

you know, we don't, you know, have a proper

data quality, data transformation solution for GCP.

We have a lot of customers that are BigQuery customers and Google Dataflow

customers that, you know, are complaining that, you know, if their data quality is bad, then, you know, their analytics are probably worthless.

So we got to solve this and, you know, we've actually never, you know, resold or OEM, a third party

technology on GCP before they had the tendency to like to build their own stuff.

But in this case, you know, I think what Google has historically been great at are a lot of things that are engines and APIs. And this was again, a very visual experience and they just felt that, you know, Trifacta had done something exceptional here.

And so they, you know, worked with us over the years to really build a very optimized service that would be able to handle, you know, GCP scale and complexity.

And then we subsequently stamped that out on Amazon and then eventually on Azure as well, so that we could provide a multi cloud story for customers that needed that to service different parts of the market. That for us was a way to make sure that what we were doing was going to,

you know, pass muster with, you know, some of the largest organizations on the planet when it comes to, you know, SaaS and cloud deployment. And that's everything from, you know, security and, you know, compliance and all the things that you have to pay attention to, you know, in the SaaS world that are, are very specialized.

But I think 1 of the things that also came out of that is with Google, you're dealing with,

this is not a startup that that gets to tiptoe into scale. Like, you're gonna deal with Google scale immediately,

and you're gonna deal with global

deployment immediately.

In some sense, we get baptized in fire a little bit here, you know, in the early days. And I think while dealing with some challenges associated with that really hardened the product very early on. And so 1 of the things that came out of all this was something that we described as a sample to scale architecture.

So you can imagine if you're trying to do all this in a browser,

in memory, you know, then you better be really good at reaching into massive data lakes or into, you know, massive BigQuery environments.

And,

you know, if you try to pull, you know, many, many gigs, you know, of that data back into a browser, good luck. Right? There's no way that's gonna be an eyes on data visual experience. Like, you're probably gonna end up with a situation where performance becomes so painful or the browser just crashes and you can't handle the data that you're trying to deal with. And so

pretty quickly we said we're gonna need to build an in memory

technology based on WebAssembly

that is going to allow

a much larger sample. So we're gonna have a sample for larger datasets. If we hit a certain threshold, we'll move to a sampling based approach.

And we'll

be able to handle much larger samples interactively in the browser by essentially leveraging

this modern browser technology in order to grab those much larger samples. But also, we're gonna have to pair that with a much more sophisticated

sampling approach

because it's not simply a matter of saying, well, let's just do a top of head sample or a random sample to get cracking. Like, you're gonna need to support anomaly based samples, stratified samples.

I think we have something like 8 or 9 different sampling techniques that we support out of the box. And if customers have their own sampling techniques they wanna use, they can essentially bring those and extend what we do. But this whole idea of, I wanna do it interactively in a browser. I want to have eyes on data. I don't want to persist data on the desktop,

but I want to make sure that I'm able to

understand

the nature and the complexity of the data that I'm dealing with, knowing that the data volumes could be, you know, measured in terabytes or even petabytes in some cases, especially if it's machine generated data.

And so that really informed our thinking from the very beginning

as to, you know, how to architect this solution,

you know, while still making it very visual and very interactive and making sure that it was not a proprietary,

you know, store that we were moving things into. We were not copying data everywhere. We were very sensitive to

not just the egress cost, but the security implications of that and wanted to make sure that, you know, we could do that in a way that would perform and also be highly intuitive to the end user. And so that

really dictated a lot of those early architecture decisions, and those were decisions that we made, you know, in part in in conjunction with Google as we rolled that out globally.

And in terms of the

evolution of the platform, I'm curious, what are some of the aspects that have

not really survived from when it, you know, started in the early days to where it is now, particularly as you were dealing with that scale up with Google?

And what are some of the sort of core elements that have maintained consistent throughout as you dealt with, you know, this variety of scale,

the proliferation of tools, the, you know, number of different integration points that you had to deal with? It's been interesting to see over the years. I mean, I would say a couple of things come to mind immediately. 1 is that

when the company started, it was in the heyday of Hadoop. You know? So, you know, everybody said Hadoop is gonna be the analytic, you know, platform of the future. And

an entire tool chain was growing up around

because the traditional

SQL based and relational based tools were not optimized for that level of scale and complexity for that platform

more generally. And so, you know, we kind of came into being kind of as that

trend in the market was, you know, moving at full force and speed.

And so in the early days, we were generating, you know, pig, and we were doing things from a compilation perspective on the back end that

subsequently got supplanted by Spark and then, you know, subsequently

then moved

to SaaS and Cloud. Things like Databricks as an example. And so this again, I guess, goes to show you how quickly things do change

and why this, you know, polyglot architecture,

you know, from an engine perspective is so crucial because, you know, ultimately,

what we were doing initially in the Hadoop world, you know, translated to kind of new languages and executions, but then also moved pretty dramatically from on prem to to cloud. You know, and then subsequent to that, you know, all of a sudden you saw the renaissance of SQL.

And so very much started doing more work in and around

SQL. And then, you know, as we started

doing more and more with data scientists who were saying like, hey, we hear our

analytic engineers and our data engineers really like what Trifacta does. And now we're trying to figure out how we can spend more time on algorithms, less time munging data.

Can we use it too? But they came and they showed up at the door saying, we really want Python, you know, and Pandas and Jupyter Notebooks and things of that nature embraced more fully. And so you just sort of see these trends kinda come along. And in some cases, they

place, you know, prior approaches. In some cases, they're additive to prior approaches because you're dealing with some different use cases personas.

But I think we saw very much an an evolution of, you know, the engines that we were working with and and, therefore,

some of the languages and the compilation

as we were generating code that we were dealing with along the way. I would say probably the secondary that we saw that where our thinking evolved was very much in and around connectivity. I think in the beginning,

you know, we said, well, gee, you know, a lot of the

ingestion or the data movement, the sort of e and the l of the ETL process has been, you know, broadly

commoditized. You know, a lot of the open source guys are giving that away for free. The cloud guys are giving that away for free.

You know, ScoopFlume, Kafka,

FTP,

traditional ETL. However you get it there by hook or by crook, you're gonna get the data into the data lake or into the cloud.

And we said, we're gonna really focus on the tea. And so we were this kind of strange animal that was, you know, essentially an ETL company that didn't do ERL. We just did tea. And we did tea in a very different way,

but that's really where we put all of our energy because we thought all the value was in the cleansing and the transformation and the profiling.

And that's where you were gonna win the hearts and minds of the users. And so we left all the other stuff to other people. The big lesson learned though was that if the data is not in these modern platforms, then you get stuck and they never get to your goodness.

And I guess the other thing you learn is that if you're pitching self-service,

if you're really trying to say like, Hey, we're going to democratize, you know, this, then, you know, sure. There's other ways they could get the data there. But in the end, they're like, if you're a self-service solution, then be fully self-service. Like, don't tell me that I need to go off and talk to 10 other people in order to get to a point where I can self serve or don't tell me I have to go use 5 or 6 other tools before I can self serve. Like the whole point is you're supposed to be taking out friction and pain.

And right now, if you only do the t, then you're not helping me service enough of my end to end problem. And sometimes getting the data where I can operate on it is not just the beginning of the problem, but in some cases can be 1 of the bigger challenges. And so we've added a tremendous amount of, I think we support now over 200 adapters and there's a connectivity framework where you can bring your own adapters or roll your own if we don't support them. There's a lot that we've done to try to solve, you know, more of the data movement challenges

as well to complement what we've done because we want to get to a point where people can

sort of do this on their own without it turning into

a project that requires

lots of other dependencies

throughout an organization, which inevitably slows things down and makes it difficult to get to value quickly. So those are probably 2 examples of how the technology trends have evolved, but also how our thinking has evolved,

you know, over the last 8 or 9 years as we've gotten into more customer scenarios and as we've kind of seen what the market, you know, wants from us. For somebody who's using the Trifacta platform, can you just talk through some of the overall user experience and workflow of going from, I have this problem. I need to be able to build some sort of analysis to, I have processed the data. I've cleaned it. I have built my analysis and just the different touch points and the different roles and responsibilities

that are interacting with this platform throughout that process.

Maybe a good way to describe that is to just give you an example of 1 of our customers and how they use us.

So

Glaxko SmithKline is a customer that's been with us for a number of years. You know, this was pre pandemic, but even then they were very focused on how do I use data to speed the process

for the approval of new drugs? Because right now from inception of a new drug to FDA approval

process is like a

decade. And they're like, we think that with better data and proved ability to work with that data, that we could at least cut that in half. Right. So this was pre COVID where people are trying to figure out how they can get this done in a year or 2 when they were like, ambition was like, let's get it from 10 years down to 4 or 5

years. Kind of first best example of that for them was in the area of respiratory drugs.

So, you know, they have a big business around things like asthma, and so they sell a lot of inhalers.

And, you know, when they would do clinical trials with these inhalers, you know, they'd send them out and people would use them and then they'd ask them to fill out surveys. And it's like, did you use your inhaler, you know, 3 times a day?

Did you use 3 pumps each time? Did you use it at 9, noon, and 5?

And, you know, they would get information back. But, you know, sometimes people didn't know they weren't using it correctly. Sometimes they didn't remember exactly how they used it. And so there's a lot of noise, you know, in the data. So at some point, they decided to put sensors into the inhalers,

and they were gonna then get behavior data. So what they called real world evidence

that would help them get more accurate reads of what was going on as they were doing these clinical trials. Now the challenge with that though is that now you've all of a sudden got this new sort of data type that you're dealing with, which, you know, was different than the experiment data, the assay data, the medical record data, you know, all of the other things that they were trying to stitch together to do the work. And they wanted their scientists

to do it, not necessarily data scientists. These were more like chemists

because they knew what the codes meant and they knew kind of what they were looking for, again, context in their heads. And so they said, well, you know, we're gonna get this to a point

process

to understand efficacy of new drugs, understanding, you know, things like contraindications,

etcetera, and doing the analysis they need to do for the FDA approvals

to get through the stage gate process. And so, you know, we became

the piece of the puzzle that,

you know, was, you know, helping them to take this disparate data that was coming from a multitude of different systems.

You know, we were helping them to land that data in a data lake. We were then helping their

scientists, their chemists

get

access to that information, get eyes on that information and do a lot of the cleanup that they needed to do, you know, to file the appropriate, you know, trial findings with the FDA and to feed that back into the process of

the research that they were doing. The last mile visualization

was being done. You know, some of it was literally Excel spreadsheets. Some of it was in things like Tableau.

Some of it was more statistical analysis that they were using SPSS and other

technologies

or other, you know, kind of BI

reporting analytic and algorithmic environments for. And so again, from our standpoint, since we're not trying to be that end of the equation, we wanna prepare the data or refine the data so that regardless of what end user tool you wanna do for the analysis, it's always in a form that you can grab onto it that's useful. It's also

governed so that if you ever have to show your work, like, how did you get that data

in that format, you'd be able to do that very cleanly. Right? And I think that sort of audit trail

or lineage also becomes really important, you know, to making sure that, you know, again, the self-service aspect of this thing doesn't, you know, go off the rails.

Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo or Facebook ads?

Hightouch is the easiest way to sync data into the platforms that your business teams rely on.

The data you're looking for is already in your data warehouse and BI tools.

Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems.

No more scripts, just SQL. Supercharge your business teams with customer data using Hitouch for reverse ETL today. Get started for free at data engineering podcast.com/hitouch.

1 of the things that is often interesting to think about as you're working with these self-service tools where the target end user is an analyst or a domain You know, it can sometimes lead

to,

you

know,

data

sprawl, where if You know, it can sometimes lead to, you know, data sprawl,

where if a data engineer were to build the system from scratch, it's going to be less accessible to the end user, but it might be more, you know, internally cohesive.

And I'm curious

how you're able to

balance those tensions

of wanting to make something

accessible to the end user but well organized and maintainable. And any

interfaces

or patterns or systems that you provide to the data engineering team to be able to enforce some of this internal structure

and collaborate with the end users to help educate them about how best to access and process the data.

What you just described was

the lightning bolt, like, moment that

I had pretty early on in my tenure at Trifacta. So, you know, I wasn't a founder. I came in as CEO after the company had been up and running for, you know, 2 and a half years.

And 1 of my first presentations

was to Bank of America.

I was, presenting Trifacta to a group of executives that had come out to California for

an innovation summit.

And I presented, you know, it was only a few weeks into the job, very passionate about this exciting new technology

and, you know, got done. And I was like, wow, that went really well. And I felt really good about it. And 1 of the executives came up to me and was like, wow, that was a great, you know, presentation.

Super exciting stuff you guys are doing. I was feeling the momentum. And then the that same executive said, but we're never gonna buy your product.

I was like

completely crushed.

And I was just trying to think to myself, like, I was like, listen, I knew this guy, you know, from my time at Informatica and I grabbed him. I said, please come to the bar. Let me buy you it. This was in the afternoon. I'm like, let me buy you a drink. And in the time it takes you to have that drink, like pound on me and tell me what we got wrong.

And and so, basically, it was a lot of what you were describing, which is like, hey, you know, what you guys are trying to do, like, we agree that

data products,

you know, are going to be the frontier for competition. That it's less just about the algorithms and about the infrastructure because it's in the open source community. You can, you know, we can spin up new environments with a click of a button.

It really does now come down to like how quickly and productively can you stitch new and interesting datasets together in new and interesting ways to yield new and interesting insights.

So they're like, we get that and we love the idea that we're gonna almost start to crowdsource

some of these data products, you know, from this broader set of users that are kind of out in the organization.

But they're like, but we are a highly regulated bank

and we

are 100 of thousands of people.

And if you guys don't really think through very carefully how that stuff is gonna be organized, how it's gonna be shared, how it's gonna be reused, how it's gonna be

kind of managed,

then

we can't afford to bring you in

because this will, you know, sort of die under the weight of the organization,

or it will,

you know, lead us into a sprawling situation that, you know, starts to get out of control if it's successful.

And so I was like, okay, well, what does this mean? Right? And they said, well, you know, you gotta think about, you know, the things that the data engineers really care about.

And again, this was part of our transformation

from like really thinking about this kind of collaboration rather than just the initial genesis of the company of, like, let's go focus on what the analyst wants.

Then now we were starting to say, like, okay, we really gotta care about what the data engineer wants, and we gotta start thinking about the pipelines that created. How are we gonna monitor those pipelines? And we gotta start thinking about the SDLC process and, like, how are we gonna essentially,

you know, integrate with stuff like Git and allow, you know, recipes and code to be checked in and checked out and sort of managed conversion correctly? How are we going to,

you know, think about templatizing

things so that,

you know, if there are well worn paths for doing certain types of work that you can grab on to templates and you can use those. How do we start thinking about macros?

How do we start thinking about, you know, approval processes where if somebody is collaborating on a recipe, that there may be certain steps where you have to check with somebody,

you know, as part of

implementing something where a subject matter expert could sort of put their seal of approval on something before the concrete gets poured. We have to think about chain of custody of the data. Like, can you go in and literally every single step of every recipe and every edit, you know, when it was made, who it was made by.

So that if somebody ever says, like, you know, if we're if we're doing

and, you know, this is true for what we did at Bank of America, it's like, if you're in the quant group

and you want to know like, okay, we built a training data set that ultimately birthed an algorithm that ultimately birthed an algorithm that we then tried to trade on. At some point, someone would say, well then how did you create that training data set? Because now we're making multimillion dollar trading decisions.

Right. So it's like, you needed to like really

just grapple with all of that

in a pretty profound way. And it completely

changed our focus and investment strategy

over the last, you know, probably 5, 6 years to grab on and embrace that more fully. And so that became

a very powerful us. And the epilogue to the story is that now Bank of America is 1 of our largest lifetime value customers,

but it sort of went from that moment of you know, kind of being punched in the gut a little bit and saying like, hey, this is cool. But if you haven't thought about this other half of the equation, then, you know, this is a nonstarter for us.

And then another

element of the

solution that you're providing, and you touched on this a little bit with your discussion about the different sampling strategies, is the

limits of scale or complexity that your end users are able to tackle using Trifacta

just because of the fact that at a certain point, if you don't have a large enough sample size or because there's enough variation in the data, you're not going to be able to produce a truly effective recipe for

processing and cleaning the information

and just some of the strategies that you and your customers have come up with for being able to deal with some of these limits of scale, whether it's, you know, what are the 3 v's of, you know, velocity, volume, or variety,

and being able to preprocess

and clean and pre aggregate the data before exposing it to the end users in TriPacto?

The scale question comes up a lot. And, you know, I think part of it is because the world has gotten to a point where, you know, they

want, you know, a spreadsheet like experience, but they understand the limits of spreadsheets,

both in terms of the data types that can handle, as well as terms of the volume

that it can contend with. It was very interesting for us to try to, you know, think about like, again, from the very early days, like how to solve that problem in part, because, you know, data lakes were kind of our beachhead in the beginning. So we, in some respect, you know, picked a tough beachhead because it wasn't like, you know, most companies probably would have started,

you know, smaller datasets, less complicated situations. Like, we immediately went in and said, we're gonna do this on some some of the biggest data lakes in the world. And that of course then led us to some of the biggest companies in the world. And so then next thing you know, like you're not easing into this. Like you're having to solve these really big complicated architectural problems

right up front, which is kind of daunting and, you know, pretty capital intensive as well. Like I needed to hire up a bunch of people who came with, you know, down in the dirt experience of having built

scalable systems for big enterprises,

you know, who could bring that mentality in while at the same time, we were trying to be this, like, disruptive

young company that was

applying, like, kind of new approaches to, you know, visualization on data and kind of applying ML and AI to the data to automate things. And so it was like kind of a marrying of, like, you know, people that were we were hiring out of, like, LinkedIn and Facebook with, like, people that I was also bringing in from places like Informatica.

That was an interesting kind of cultural experience for us to go through. But, yeah, that, like, truly trying to go after that and to sort of think through,

like, how we're going to do that, you know, became a key differentiator for us. And it's a lot of, you know, hard lessons learned as you get into these big projects and you see the scale of the volume that the data has. And I think, you know, to your question from earlier, it's like 1 of the other things we learned though, is that at some point people also wanted

different, almost zoom levels on the data. And so we took very much again, this spreadsheet style metaphor,

but we were doing everything

behind the scenes that was more of a push down operation into these various engines.

That's where we could get the scale. And we got good at the sampling, and we got good at this sort of in memory work that we were doing in the browser and the interactivity and all of that. But then some people said, well, hey, but I still wanna be able to zoom back and almost take a more process centric view to what's going on with my data.

And so then we said, oh, okay. So we created

something we call the flow view, which would be probably more

typical of a traditional ETL kind of mapping, you know, kind of view of the data.

And then

as we started doing more work with the data engineers, they're like, well, you know, this is a long tail market, right? Like there's a lot of data complexity out there.

And as much as what you're providing visually is awesome. There are always gonna be some corner cases where there's some weirdness in the data or there's something going on in the underlying systems that

you need to be able to handle that may not be a first class feature in Trifacta, and we can't have

your roadmap

prevent us from getting those use cases done. So that's when we introduced this idea of more of a code centric view, where if you really knew, need to drop down and write a little bit of code, you can do that because there may be things that, you know, we couldn't have anticipated or maybe we've anticipated them. We just haven't been able to get to them. And we don't want the answer to be no. We want the answer to be here's how. And that's where, you know, user defined functions and pre and post processing and parameterization

and, like,

all kinds of capabilities come out of that that give people the ability

to the more technical users, the more the traditional data engineering audience to go in and actually,

you know, kind of help contend with that scale and complexity that inevitably exists,

you know, in the exotic world of, like, all this data coming together and, you know, from all these different systems and this voracious appetite for people to stitch it together in unique ways.

And as you have been working with Trifacta and growing the business and working with your customers, what are some of the most interesting or innovative or unexpected ways that you've seen it used?

There was a center for disease control case that I was fascinated by.

So there was a number of years ago, you may have heard of this, There was an article, I think it was in the New York Times that was about

an outbreak of HIV in rural Indiana.

And the CDC got involved because they were just trying to figure out, like, what the heck is going on? Like, this is very strange that we're seeing this, like, hot,

wasn't a pandemic, but a sort of epidemic,

localized to this specific region, this outbreak.

And, you know, they have essentially data black belts, right. That bomb in and their job is to like, try to understand what's going on. And a lot of that is, I mean, you're looking at, you know, geospatial data. You're looking at to see where things are going. You're looking at all kinds of research. You're looking at medical, like data from the hospitals that's coming back. You're looking at police blotter, you know, or police record data. You're looking at demographic data. The list goes on and on and on because you're just trying to piece together a narrative on, like, diagnosing, like, what is happening and why is it happening here of all places? You know, these are very urban, you know, issues. And all of a sudden it's like in the middle of this, you know, relatively small town.

And, you know, eventually you start to see, you know, kind of a picture comes into view

where looking at all this data and keep in mind that like, when people are dealing with some of these types of situations that have social stigma attached to them, it's like, if you're showing up at a hospital, you may not be using your real name.

Right. You know, there's things people are doing to try to like

hide data or to make sure data is not so easily found or interpreted because of the circumstances.

And

so so anyway, so it just makes that challenge even harder. Well, you know, they were able to, you know, use trifecta to really start to put together a narrative

on what was happening. And as it turned out, like by looking at all of the trends and the analysis

and patterns and what they were seeing is that, you know, with the continued

gentrification

of certain parts of Chicago,

it was forcing, you know, people to move further and further into the suburbs,

You know, in some cases into these small towns in Indiana, they were starting to bring then some

of the urban problems with them. A lot of these smaller towns were ill equipped in terms of, you know, needle exchanges and, you know, condoms because things like prostitution and other things started to become a little bit more prevalent. And so all of a sudden you started to, like, see what was happening here,

and then you could go in and try to remediate this before it becomes, you know, through education and through other, mechanisms before it becomes, you know, too much of an issue and sort of spreads even further and further into further falling regions of the state. And so that was for me, like a really amazing story of, you know, how data is like, you know, being used, you know, broadly for social impact. And also, I think a recognition

of just how hard

it was to get all of that data, which frankly was more about the diversity of the data than the scale. It wasn't that the volumes you weren't dealing with, you know, terabytes of data. You were dealing with lots of small data in lots of crazy formats

with lots

of quality problems, like a lot of sparse data, a lot of data that was anomalous in different ways

that ultimately, you know, had to be brought together to, like, tell this story and to get to some conclusion and then ultimately, you know, a diagnosis and a and a remedy, you know, for the issues they

were seeing. And, and to me, it's just a great example of the impact that platforms like Trifacta can have on, you know, speeding up that time where, you know, minutes and days matter, you know, when you're dealing with something like HIV and something like an an epidemic in a specific, you know, part of the country.

Yeah. It's definitely pretty wild story.

In terms of your experiences

of being the CEO at Trifacta

and particularly somebody who wasn't part of the founding team, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process of coming into the company and helping to grow it to where it is today? I guess my entry was pretty straightforward in that, you know, I came in with a lot of domain expertise. Even a number of the people on the team, you know, from day 1, I was able to have immediate impact because I could bring

my network to bear very quickly on, you know, both acquiring customers as well as building out the organization. So by a lot of those things were actually

pretty straightforward for me. It also helped that the co founder, Joe Hellerstein, who was the CEO before me, who's now currently our chief strategy officer.

He said from day 1, he's like, I'm a professor,

you know, and at some point this is going to become as much about commerce as it is about technology.

And that's the point at which, you know, I'm going to need to take a different role here and probably go back to teaching part time and working here part time. And so he made that pretty clear to all the investors, the employees from the beginning, that he was not gonna be the long term CEO of the company, that there would be a managed transition, you know, when the timing was right. And there were no big surprises there when that happened. And I think that was, you know, very, you know, mature of him in the sense that, you know, he kinda recognized what he wanted to do, what he felt he was good at, where he thought he could have value. And sometimes you have founders who wanna hold on at all costs, no matter how, you know, the world changes around them or what they're sort of best at. And there was none of that succession

challenge, you know, between me and Joe, and we've been great collaborators

ever since. So I think that that part of it was really great. But I think the fact that that was set up that way from the beginning,

you know, was a good lesson, you know, to me and, like, you know, thinking through succession planning and thinking through how stuff gets communicated,

thinking through how you set expectations around that, thinking through bench strength and, you know, all the different areas that you're coming to depend on. So that was, I certainly think a good experience and a good set of lessons learned. Sort of beyond that, in terms of my rival, you know, the thing that I find is both is you got to hold in your head simultaneously,

you know, is that you gotta be all about alignment

and focus, but then at the same time, you also have to be adaptable. And those things tend to push and pull on each other in ways that create a lot of tension.

I think back to when we started and it was, you know, on prem data lakes kind of roamed the earth. And then all of a sudden, you know, cloud came in and cloud data warehouses, and it became as much about,

you know, SQL and cloud as it was about things like Spark and on prem data lakes and things shifted, you know, pretty fast.

And so on the 1 hand, you have to be really focused on what you're doing because I'm a big believer that, you know, indigestion kills most small companies more than starvation,

you know, and you just have to keep everybody

rowing in the same direction. And, you know, we have been pretty maniacally focused on the same thing since the beginning, you know, this vision around,

you know, self-service and this vision around user experience

and machine learning informing that, and

just staying focused on the data prep pieces and on the ETL ELT pieces. We know in the past people have said, well, why don't you, why don't you go do a analytic

workbench? You know, why don't you go do BI? Why don't you go be a catalog? Why don't you go do all these other things? And we've generally said, no, no, no. In part because of focus, in part because we don't wanna dynamite our ecosystem of partners that bring us into opportunities or that we work with in implementations.

But, you know, things do change, you know, pretty quickly in, in this world. And so as much as you have to be crazy focused,

you also still have to

make sure that you don't miss these

secular shifts that happen.

And it feels like they're speeding up these days. And so all of a sudden, when it became about SaaS, we had to like burn the boats and be about SaaS,

you know, and all of a sudden when there was this SQL renaissance with, you know, cloud data warehouses, which, you know, analytics were very stubbornly and on prem thing up until

I'd probably say 18 to 24 months ago. I mean, it's not even that long ago, but with the rise of, you know, things like BigQuery and Snowflake and others, like now all of a sudden people seem, you know, way more comfortable doing, you know, their analytic work, you know, in the cloud. Whereas, you know, applications have gone cloud, you know, a decade ago. DevOps went cloud a long time ago.

But analytics was again, still very stubbornly on prem and then boom, the bit flipped. And when that happened, like you don't want to be on the wrong side of that trend.

So, you know, it is this sort of all of this constant tension of like alignment and adaptability.

And and you kind of have to charge hard at the thing and get everybody a 100% focused on the thing until it's not the thing.

And then you gotta shift and you gotta shift fast and aggressively.

Otherwise you miss these trends.

And so I think that, you know, for me has been both like the really exciting part of this job. It's also sometimes a little bit the nauseating part of this job,

but

but in the end, you know, I think that's what you sign up for, you know, when you're an entrepreneur and that's what you sign for when you work in technology. And increasingly, that's what you're signing up for when you're in data and analytics.

And for people who are dealing with

trying to get their data into a usable shape or they're trying to democratize access to it, what are the cases where a trifecta is the wrong choice?

What I would say is, you know, there are certainly, you know, cases where people are, you know, focused on

things like streaming.

That's not a focus for us today. You know, that at some point

logically it would make sense that we'll get more into doing work with streaming data.

But, you know, we tend to operate on data that's landed somewhere. We tend to be batch or micro batch. So you can run it every few minutes if you want, but we're not transforming data in stream.

You know, there's definitely like hardcore replication, like change data capture style use cases that it's not streaming data, but, you know,

a similar ilk in terms of, you know, kind of more sophisticated replication that is not what our focus is.

There are, you know, definitely cases where people just

for a variety of reasons, want

a code first or code only solution for how they're going to do this work.

And while we are embracing

code more and more every day in terms of Python and JavaScript

and SQL and

providing again, opportunities to put code into Trifact

at different levels and in different places.

There are cases where,

you know, if it really is like,

we just want to write a bunch of Python. It may not be the best fit for us. Certainly there are cases where people want to write a lot of code who still want our profiling and still want some of our orchestration and other things that we do.

They don't want to have to write all the code. They just want to write some of the code and that's fine. But, you know, I would say those are a couple cases that come to mind. And then, the other things that I mentioned, which is sometimes people get confused around

because we're so visual, like, they think, oh, well, maybe Trifacta could be my dashboard or my my sort of BI, you know, solution or my reporting solution. Again, while we take a very visual approach,

it's visualization in the service of cleansing and transformation of data.

It's not visualization,

you know, for the purposes of end user, you know, downstream

consumption in that regard. So and I think we try to be pretty careful, you know, not to position

where a click, a Tableau, you know, you know, Power BI,

a Sisense, a

ThoughtSpot, or somebody would play.

And as you continue to

lead the company and explore the

constantly emerging trends in the data ecosystem? What are some of the things that you have planned for the near to medium term of Trifacta?

A lot. So I think

we're spending

a lot of time right now

doing what you've mentioned dbt earlier and earlier, we're doing a lot of work in and around dbt.

So we've got you know, some of that that is in private preview right now, but you'll you'll see some announcements from us

around the work that we're doing. It's a great open source project that the community is really grabbing onto, and we think we've got a role to play in that

story. Stay tuned for that. You know, a year ago, we added a lot of capability in our optimizer to think a lot about where to run certain things. So doing certain types of push down based on what was going on in the different sources and targets.

More recently with Google, we announced, you know, full compilation to SQL where BigQuery would be the engine.

And we would actually run everything inside of BigQuery.

And you're going to see us do more of that with more of the cloud data warehouses where, it's not just doing kind of partial push down, you know, into different engines, but literally full compilation,

re leveraging those engines as

the engine and the cost and performance characteristics we're getting out of that, if that's where your data

lives, is just massive.

So that's been super, super exciting,

you know, continuing to, you know, build out what we call the developer experience story. So doing more and more to facilitate

the things that the data engineering crowd, you know, is pushing us on, you know, with

integrations

to, you know, airflow and get and, you know, all the things sort of in and around that. We've done some of that today. We're gonna keep doing more and more. We're keep we're continuing to optimize,

you know, how we tie into the rest of the tool chain that the data engineering community is focused on. Those are some of the chunkier areas. Also doing more around monitoring drift, you know, providing a lot of intelligence on how to handle drift, you know, through some of the pipelining things that we're doing. Those are all, you know, pretty, you know, significant areas of investment for us. So

Are there any other aspects of the Trifacta platform and the work that you're doing or the overall place that it has in the landscape of data that we didn't discuss yet that you'd like to cover before we close out the show? I think we covered a lot of it. Maybe the other thing that we didn't get into, which is a little less about the technology, but more about the business model for us is that, you know, we've also moved to a completely usage based

model. So we were always a subscription service, but, you know, we've

really moved to more usage base because it's important to us not just to democratize

the the actual production of the data products, but to actually have a model for pricing and packaging that can be democratized as well. So doesn't do any good to say, you know, we're all about democratization

and then, you know, somebody's gonna charge you,

you know, $10, 000 a seat or, you know, charge you some huge, you know, platform fee in order to get started.

You know, we think that, you know, the trifecta you can get started for free. You know, the first tier is, you know, $80 a month.

And we want to make it really accessible

to a broad range of

users, a broad range of use cases, a broad range of different types of companies.

And, you know, we think that's really important, you know, so I think that that's probably something that gets less focused, but, but ultimately that business model has to be coherent,

your technology story with kind of your go to market strategy. And so we've put a lot of effort into, you know, refining that over the last couple of years, you know, as we've started to see more and more users come to us that are not just in the big banks and insurance companies and pharmaceutical

companies, but as we're seeing more digital native companies, as we're seeing more startups, as we're seeing, you know, companies that are in the mid market, you know, that are doing often really exciting, really

things with data.

But at the end of the day, you know, need something that's kind of right sized to, you know, what they can afford, to roll out and often don't have, you know, big IT organizations

that can take on massive projects, but really need something that is, you know, completely integrated, you know, from an experience perspective.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. And this probably relates a little bit to our investment thesis, but, you know, we keep hearing from customers that

the data quality piece in general

still does not get

enough attention.

And I think that people are terrified that

as you start talking about

more automation.

Right? So, you know, whether it's kind of robotic process automation or whether it's just more algorithmic

decision making, that's gonna start happening in an automated way that people are terrified

that we may start automating bad decisions

faster based on bad data.

And this has us, like,

you know, it's I always tell the team, I'm like, we can't invest enough in

the data quality capabilities

of the platform because every time we come out with something, it is like off the charts used immediately. And so you're hearing a lot more about,

you know, maybe a more modern spin on it around observability.

You know, I think that certainly is 1 aspect of all of this. But, you know, I would just say that, you know, forever,

this has been a constant challenge for a lot of organizations.

And people have lamented the fact that their data quality is not great.

And, you know, I keep hearing from executives now and even from the data engineers that we work with that they wanna take this moment in time,

you know, as the burning platform to, like, go fix some of those data quality problems and to get good at identifying them and remediating them because,

you know, the the logical conclusion of this is that,

you know, you're gonna have to start finding, you know, needles in the haystack when the haystacks are coming at you faster and faster.

And where a lot of what's happening, you know, based on that is getting more and more automated. And so it just is both a bigger opportunity, but also a bigger threat if it's not done right. So I guess that for me is probably the 1 that sticks out the most. Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing at Trifacta. It's definitely a very interesting platform and an interesting approach to the overall problem of dealing with messy data. So appreciate all the time and effort that you and your team are putting into that, and I hope you enjoy the rest of your day. Yeah. Tobias, thank you so much for having me. It's been a lot of fun.

Listening. Don't forget to check out our other show, pod cast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links