Evolving An ETL Pipeline For Better Productivity

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With 200 gigabit private networking, scalable shared block storage, speedy SSDs, and a 40 gigabit public network, you get everything you need to run a fast, reliable, and bulletproof data platform.

And if you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and 1 opening in Mumbai at the end of the year.

And for your machine learning workloads, they just announced dedicated CPU instances

where you get to take advantage of their blazing fast compute units.

Go to data engineering podcast.com/linode,

that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers for engineers.

Clubhouse lets you craft a workflow that fits your style, including per team tasks, cross project epics, a large suite of prebuilt integrations, and a simple API for crafting your own.

With such an intuitive tool, it's easy to make sure that everyone in the business is on the same page, And data engineering podcast listeners get 2 months free on any plan by going to data engineering podcast.com/clubhouse

today and signing up for a free trial.

Support the show and get your data projects in order.

And you listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, Dataversity, and the Open Data Science Conference. And coming up this fall are the combined events of Graphorum and the Data Architecture Summit in Chicago.

The agendas have been announced, and super early bird registration is available until July 26th for up to $300

off.

Or you can get the early bird pricing until August 30th for $200 off your ticket.

Use the code b n l l c to get an additional 10% off any pass when you register. And go to data engineering podcast.com/conferences

to learn more and to take advantage of our partner discounts when you register for this and other events. And you can go to data engineering podcast.com

to subscribe to the show. Sign up for the mailing list, read the show notes, and get in touch. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Your host is Tobias Macy. And today, I'm interviewing Aaron Gibralter and Raghu Murthy about the experience of Greenhouse migrating their data pipelines to Data Coral. So Aaron, could you start by introducing yourself? Sure. Thank you, Tobias.

Thank you for having me. Yeah. Again, my name is Aaron Gibralter. I'm 1 of the directors of engineering here at Greenhouse.

I work with 2 different teams.

I'm both on the product engineering side, running a team building 1 of our products, and I also work with our data science and data engineering teams. Greenhouse is a talent acquisition suite that helps companies

acquire

and retain the best talent. And, yeah, I guess as as as is the case for most companies these days that build software, data is incredibly

important to us. And, Raghu, could you introduce yourself as well? Absolutely.

Yeah. My name is Raghu Moorthy. I'm the founder and, CEO of Datacor,

a company where we are automating the building of data pipelines,

and

we have built it in such a way that it's all serverless. We have, been working with Greenhouse over the past couple of years, so I'm really excited about conversation where

we're able to kind of just

take folks through the journey that Greenhouse went through, while we were growing and while they were getting started on using their data as well. And also for anybody who wants to dig deeper into data coral itself and your experience of building and growing that company, I'll refer them back to the other interview that we did with you, and I'll add a link to that in the show notes. And so going back to you again, Aaron, can you do you remember how you first got involved in the area of data management? Yeah. I I think it's a bit of an interesting story for me. As I mentioned before, I work with 2 different teams and 2 different disciplines.

Product engineering and data are a bit different. And so the data piece was a little bit more happenstance.

I think when I joined Greenhouse 4 years ago and

not too long into my tenure,

my boss,

now our CTO,

the VP of engineering at the time, like, before it asked me,

to step in and and get involved with our

data scientists at the time we had 1. His his name is Andrew Zerm. He came from academia. He was a, astrophysics

professor, PhD,

and he had gone through

1 of the data science accelerate our boot camps, to kind of transition from academia to the business world. And so he joined Greenhouse

and kind of started building out some of our the data pipelines that we used to wrangle the data and also started to build out our reporting and analytics capabilities

and as well as, some machine learning stuff. But he was alone working on the team and, Mike Buford asked me to

start to work with him,

just to to to think about different use cases. And so so that's how I got involved. It was kind of a a need the company had. I had a little bit of spare bandwidth at the time, and so

it was,

you know, just luck in a sense. But it it's definitely

become a a huge interest for me, and I've I've become quite passionate about both the data science and data engineering side. And, Raghu, for anyone who hasn't listened to your prior interview, can you share again how you first got introduced to the area of data management? Yeah. So I've been an engineer working on data infrastructure and distributed systems for a while. Starting back in the day, I had to meet all the data pipelines that, typically need to get built out,

for these teams.

And so going back to you, Aaron, you mentioned a bit about what it is that Greenhouse does. So I'm curious if you can talk a bit more about some of the ways that you use data within the business and your overall data infrastructure

as it was before you made the move to data coral. Sure. It's it's it's interesting. I I think we've come a long way. As I mentioned, Greenhouse,

is a a a talent acquisition suite. We have a a number of different products that, help companies

with their hiring and onboarding process. I I think

it's SaaS software. So, you know, companies

buy our software, and then it's, you know, provided through the web. And as a as a result

I I mean, I I don't wanna sound like it's it's unique to us. Like, I think pretty much any software company now generates a a ton of data every user act user interaction and kind of the the state of the database at any given time has a lot of

meaning. And so for us, the use cases for data, which is kind of like this big amorphous

blob term,

kind of span spends the gamut from kind of us understanding our customers and how, you know, how much value they're getting out of our platform

to helping our customers understand their own data even better

and kind of everything in between. The state

of the

of our data pipelines,

before Data Coral

were kind of,

as I mentioned, we had 1 data scientist

who is a little bit of a he's a he's a polymath. He's a generalist. He's

always been interested in everything from the infrastructure

side to

the modeling side. And so he actually

dug in and and ended up building out our ETLs,

himself.

And so he I think what our infrastructure team did at the time, this is 3 or 4 years ago is they they set up an EC 2 instance for him in our VPC

and gave him his own RDS and said, Do whatever you want, go to town. And so he stood up airflow

there and built out a series of,

ETLs to start to pull data from our production.

Databases

are just connecting to our followers

and,

and in in our infrastructure, and and and just pull data out and reshaped it into the data science warehouse,

which was his RDS. And then he connected, you know, BI tool. I think at the time we were using Periscope. And

so started to build dashboards on top of it so that we could understand our our customers behavior.

And we had a feature at the time. I think it was called the matrix.

And the idea was it was kind of like a matrix of every feature in our product

and every customer

and, like, check boxes on what, you know, which features, which customers we're using. And that gave our both our product team and our customer success team a better understanding

of, you know, the lay of the land. And so that was kind of the first stage in exposing

this data. I think you just going back to your question, what the use cases now have expanded the company is much bigger

and I think we've even gone into

building predictive features

for our customers. So we have a feature in our product called Greenhouse Predicts.

And what it does is it looks at the status of a particular job pipeline. How many candidates are in every stage?

How many candidates are an application review,

initial screen, phone screen on-site, and so on. And we built a model to predict

what when we expect a hire to be made. And so that's a feature that we offer to our customers.

And so that requires us, you know, to train the model. We pull data out of our data warehouse in a particular shape, and then we build the model, train it, and,

and then, have deployed an inference engine to allow our application to that

we

that we have. But most of as I mentioned before, most of it was ad hoc,

like point to point ETLs before data coral. I know that was a bit of a a long winded answer to your question. No. That was great.

And just to get a bit of a flavor too, I'm curious

about the just overall team size and structure that you're working with for the data engineer versus data science breakdown and the sort of overall number of customers for that infrastructure.

Sure. And by customers, do you mean? In internal users.

So when we started, as I mentioned, we we literally

just had 1 data scientist

with some support from product engineering and infrastructure,

doing everything.

So

at the at the time when I joined the company, it's 4 years ago, I think our engineering team was probably around

20 to 30 people, and the whole company was around 70 people. And we had 1 data scientist doing everything. And now

we are the company is 300 or over 300, and,

our engineering team is

70 to 100.

You know, it's it's always hard to keep track with how much we're growing. But we still have a pretty small data science team. And I I I'll go into this a bit more later on. But I think what data coral has provided us is it it gives us the ability to punch above our weight class in terms of our data science team and what we're able to provide. So we have

no data engineers

at Greenhouse.

No we have some teams that help with data engineering, but we have no dedicated data engineers.

And we have 2 data scientists and we're we're growing right now. So we'll be bringing on another, let's say, 2 to 3 data scientists

this year.

And, and with no expectation or need to to hire any data engineers to support that. And and largely in part because Data Coral gives us the tools to to manage it ourselves without needing to, to handle the infrastructure.

The number of internal stakeholders we have

is grown a lot. We've predominantly started with customer success as our main stakeholder,

but we have moved into

working a lot with marketing,

finance,

sales,

support and engineering, R and D. So understanding

our own site performance,

kind of our engineering throughput,

our product OKRs.

So while there's so much that we can do and so much more that we can do and that's why we're growing. I think as I said, I think we're able to punch above our weight class and do a lot more because we have because we've,

adopted data coral, and

it it it provides us a lot of of leverage.

And can you talk through a bit more about the types of data sources that you're working with for pulling into your data warehouses and into your analytics?

Sure. At first, when we managed our own ETLs, it was mainly

just our production database. So we run Postgres

as our as our main production database for our application.

So it was mainly pulling data out of that about what our customers, you know, we're doing in in in the product. Like, what, you know, what data

artifacts are being left over from

user interactions.

And, I think we may have built out some Salesforce ETLs

to pull,

data from Salesforce because

Salesforce is a source of truth around customer,

information like key points of contact and in, you know, segmentation like industry size or what industry the company is in or what their addresses for, like, look company location.

So those things we were pulling from

from Salesforce.

And now we pull in from a lot of,

sources into our data warehouse with data coral. So the main sources we have, again, the production

databases is probably the most important, but we also pull in Salesforce data, Zendesk

data, Jira.

And so now, you know, Salesforce, Zendesk, and the the production database are all more customer data, whereas Jira

is our kind of internal internal data. You know, you know, what cards were shipped, you know, what features were shipping and how quickly. And so now getting in, we're actually going deeper into that, pulling data from Datadog

and,

and Asana. So other project management software. And so that's been incredibly helpful too to have in our in our warehouse.

Yeah. Being able to correlate

sort of general usage patterns with when a specific feature might have been shipped, I can definitely see as being very valuable.

Exactly.

And so

in terms of the data pipeline as it existed,

sort of leading up to the point where you started looking around for alternatives, I'm curious what were the biggest pain points that you were dealing with and the ultimate motivation that led you to reevaluate your approach to the ETL processes that you were managing.

Yeah. I think it was a it was a combination of things that led to the reevaluation.

I think the biggest pain points were, I think, at the time, our our team was was quite small,

so it was just 1 person.

So there was a lot of concern around the single point of failure. All these ETLs being built out in kind of like a custom way, you know, with 1 person

building them,

you know, something, you know, if,

you know, We always talk about bus factor in in engineering. I like to maybe think about more like the lotto factor or something a little less dark.

But, you know, for some reason,

Andrew,

you know, if he decided to leave,

I think we would have been in a really bad state.

You know, there's very little he was kind of working in his own world.

We even called his his part of our VPC.

His last name is, as I mentioned is Andrew Zerm. It we called it VPZ,

virtual private Zerm. And, you know, whenever you have something like that, it kind of raise it. You know, it's it's it's a bit of a organization

smell or

robustness smell.

And so

1 of the biggest pain points was that idea of the single point of failure. The other thing is maintenance of it was was taking up a fair amount of his time. Ideally,

a data scientist

should be spending

most of his or her time in some sort of leverage capacity

around data as opposed to working on the plumbing. And so, you know, I think probably at the time, I would have to, you know, ask him. But my impression was that he spent at, you know, between 25 percent

and maybe the worst times, 50% of his time, kinda wrangling

data as opposed to spending it on analysis

or predictive

pieces.

So that, was definitely a big pain point was, you know, how much time we were investing in it. And so once you came to the point where you decided that the current state of affairs wasn't really gonna be tenable in the long run, I'm curious what your criteria were for determining what a replacement would look like, whether that was a build versus buy decision or if you were looking at bringing on dedicated data engineers on staff and the types of tools that they might be interested in and just the overall process of going through that evaluation and then the ultimate decision making process?

Yeah. I I think it,

the way you put it there is a generous way. I think in some ways, it was less

intentional. In many ways, we got lucky.

So I think things were working. I think in some sense, the cost the main costs or pain points were opportunity costs. I think because I was new to the field, I didn't really know what good look like. So

I think we were okay in some ways with the status quo. And I think luckily,

we had an advisor

through 1 of our investors that would speak with Andrew

on a regular cadence.

And he he mentioned

that we should consider,

looking at Data Coral and thinking about our our data stack. And so it was through that that we started to explore it and and realized

kind of what the possibilities would be.

But you know, in some ways, I wish that we had been more intentional earlier on and saying, hey, what does this look like a year from now? But I think we were just kind of treading water and doing the best we could with the situation.

And so I do think that in retrospect, if I were to do it again, I'd have a much better perspective on, like, on efficiency and and where we're spending our time,

and and what, you know, what leverage looks like. But, I think ultimately, we got lucky and that data coral kinda came along, and

and we were also,

you know, I think it was also very early in data. It was, you know, very early in data corals existence as well. So it's kind of like a, you know, fortuitous moment or, you know, it was just luck that we kinda found each other at the at the right time, and we could grow together

and and get to a much better place. And so

after you got introduced to Ragu and Data Coral and made the decision that you were going to replace your existing

overall experience was of getting onboarded and making the cutover and determining

the sort of data quality, ensuring that it matched what you were already getting, you know, or that it was improved and the,

overall experience

of sort of drawing down the previous infrastructure in favor of data coral. Yeah. Definitely. I I think another

just going back a little bit, another,

big piece

in terms of our evaluation criteria. But I think the 1 thing we knew all along

was that we never wanted our data to leave our our VPC.

I think as is the case with a lot of companies, are we I mean, maybe not, but we we care a lot about

the security and and,

and

compliance

piece for us is of extreme

importance.

We have a lot of very sensitive information

at Greenhouse, and

we obviously,

you know, would never want we would never use our customers data in a way that we weren't explicit about. And so

1 piece there is that we we knew that in any world,

it was either something we would have to build internally,

but the idea of a vendor where we would actually have to pipe data into someone else's warehouse or

through someone else's pipes was not really an option. And so that was I think Data Coral was built that with the with the idea that the data stays in your own infrastructure from the start. And so I think when we found that out, that's kind of 1 of the barriers like, 1 of those barriers to entry was immediately lowered, and it was something that we were interested in in testing out. And so I I think yeah. I I I think that that,

that was

a really important piece in the evaluation process and allowed us

to test out a use case

without without, like, moving mountains. Because I think right now, every time we had a sub processor, it kind of has to go into all the contracts or all the customers have to be made aware that, their data is going to, you know, be, you know and we have a number of, you know, we use AWS. So,

like, I think it it's not possible. We're not like a building our own hardware

and and and and doing everything on prem. And so it's not crazy that we would send or use a vendor

to to achieve something, you know, achieve some value for our customers.

But, in this particular use case, we knew that we wanted to to keep, the data flow in in our own VPC. So when when we were talking about the cut over process and using data coral, I think it was quite

easy for us. I we we did a security and compliance review to make sure that that kind of all the cloud formation templates that the data coral is handing us

and and all the the stuff that they were providing us was sound. But

beyond that,

we I think it was easier for us to give it a try because we knew that it was running in our own infrastructure.

And so that was kind of the lens that we use is like, hey, we can stand this up. Let's try it out and see how it works. And if, you know, if it seems like it's if it's,

if it seems like it's worthwhile or if it's if it's if it's easy and and provides value to us, then we'll continue. And if not, we can we can easily spin it down and and undo that. And so that was a a pretty easy way for us internally

to evaluate it. And,

and I I think I talked to Ragu a bit about this, but because because we started using DataCoral very early in DataCorals existence,

I think that the cutover process was definitely not cut and dry. I think there was a lot of us working together to figure out the solutions to our problems, but I think we got to a really amazing place. And I think that process now for a company evaluating DataCore will be quite different. I think DataCore is quite mature now. And so I think it that

whatever I say here would be very different for someone trying it out now. But, yeah, that, that cutover process probably took I don't know. Raghu, do you have a sense of that is probably over the course of a year that we adopted Data Coral, and then we're finally able to sunset

our existing ETLs. Maybe even longer than that before we

before we sunset our existing ETLs. But I I think that if we did it today, it would be much faster. Yeah. So,

maybe I can, add a few things here. So we started talking to Greenhouse very, very early on in our existence. And, clearly, my goal, at that point in time was to see the kind of the architecture that I had come up with and we are kinda working on. Would that actually make sense? And the great thing about Greenhouse has been,

as you can imagine, for an early stage company,

the most valuable thing is,

time from potential customers and patients. So they were kind of Aaron and, his team, they were convinced about the architecture.

And as you mentioned, the fact that, we are running within their VPC load, the barrier to entry for their for them to actually try it out. And then just like what we do even now for our existing customers, it's start off as, like, 1 use case. I think the first use case was to,

pull data from their production environment

and, move it into a data warehouse, Redshift in this case, and essentially get Andrew Zerm out of kind of having to query, like, like, a follower database,

of the production data of this, follower of the production database or, like, having kind of do anything kind of in an ad hoc manner. Instead,

it is actually a better thing to have him, just focus on doing the analysis itself, and he could do that directly on Redshift.

So that is kind of the first take for us to just kind of replicate 1 data source. And then slowly but surely, we started adding more and more connectors.

And, like Aaron mentioned before, they're kind of pulling from Salesforce, Zendesk, Jira, and all, like, now, I think, at least, I have lost track of the number of different sources from which they're collecting data. But in terms of the cutover period, like, the way that typically happens,

right, is that if they if companies typically have something already, that is kind of processing their data, they don't wanna just rip and replace everything because that actually is, is much harder to do. Instead, the better thing to do is

to kind of take 1 use case and and actually make it work. And our our whole kind of microservices

based architecture means that we can live alongside whatever companies might have. So as more and more use cases came along where people were able to just use Datacorel directly instead of whatever they had,

originally,

So there were net new use cases that were being directly worked on in using data coral. And, of course, the existing stuff can can be moved or you can kinda keep it around or sunset them or however you wanna,

like, plan it out. The goal for data coral is to make sure that whatever,

use cases you wanna actually move, those are the ones that we can actually make super easy. And, again, very early on it,

in, our engagement with Greenhouse, given that we had essentially, like, an overall idea of the architecture and, like, an initial kind of pre alpha even implementation.

We did kind of go through a lot of learnings while working with, with Greenhouse. And even the whole kind of the security architecture and stuff like that, I mean, we work pretty closely with Greenhouse to get it to a point where there's not only that their security,

team was happy with it, but we were able to leverage that work and, get a lot out of it even with our other customers or even while working with AWS to get to the advanced technology partnership. There's a whole set of questions in the questionnaire, as you can imagine, when you're going trying to get through these compliances or these kind of partnership levels around security and whether you're a data processor. All of those questions essentially were kind of completely irrelevant to us because all of our software was running within the customer VPC. Yeah. That the security piece, I think, we work through

very closely together. You know, as I said, security is of utmost concern to us. And so when we I think that was the the hardest part of of getting started with Data Coral,

not in a bad way, but we just had to figure out an architecture that made sense such that Data Coral would not have access to any of the underlying data. And we were able to get there, And so it was really a great experience working with data coral to to make sure that that was the case.

And I'm glad that it has contributed. And and and I'm I'm happy that it contributed to kind of the overall arc like, the standard architecture for for how data coral does these engagements.

And, Raghu, I'm also curious about some of the sort of edge cases

or

sort of sharp points in your infrastructure and architecture

that ended up getting ironed out in the process of onboarding Aaron and Greenhouse and any of the other,

customers that you were working with in the in the sort of similar time frame? Yeah. Absolutely. I mean, 1 of the main things that you realize after kind of when you're building a system and then getting a bunch of customers, at least initial users of the platform or, like, the system is that,

there there are, as you said, like, sharp edges. Right? So it's very easy to get into bad states. The amount of error checking or, like, error propagation kind of you don't pay as much attention to it early on because you're mainly trying to establish the viability of the technology overall. So for the most part, like, at least initially, we would have to handhold customers through kind of setting up data coral, like setting up these connectors.

And then as the data was kind of flowing, they would be able to kind of fend for themselves. But then if there are errors and stuff like that, instead of them knowing about it through, like, a tool or whatever, they might kind of run into problems because, hey, the data is not fresh. What happened? So we had, like,

like, a Slack channel where people could just ping us. And along the way, we have, clearly made our, overall platform a lot more robust. We have gotten to a point where

our customers typically don't have to kind of actually worry about kind of data quality. We kind of catch errors,

sooner,

than anybody else can actually kind of notice it. We are able to fix them. And, again, all of this is happening because you're providing this whole,

kind of, automated data pipelines as a service. Right? So we use this notion of a cross account role that allows us to monitor

everything that's happening in the in the customer in the in the installation while still not having access to any of the data. The data itself is encrypted using keys that our roles don't have access to. So,

this whole combination of providing, like, a SaaS offering, but within the customer VPC, I think that has allowed us to, in some sense, give,

give ourselves the time to build out the automation while still kind of using operations to make sure that everything is actually working well. And I could I don't know if you want me to go into these details, Raghu.

But I I think it was there's some interesting, like, getting into the nuts and bolts. 1 of the pain points we ran into, I think the original ETL

system that Data Coral

was working with

and frankly, what similar to what we had going in our own airflow ETLs

was this concept of pulling data out by the updated adcom. We have timestamp columns in our Postgres database that presumably say the last time the row was touched. Unfortunately for us,

our application,

the timestamp columns

are not trigger based,

in our database.

So it's up to the application or the query writer to always update the timestamp,

at

the update time. And so we found that there were actually cases of bulk updates that we did that would affect large number of rows, but not touch the updated at timestamp.

And so we actually had

data quality,

issues or data consistency issues between our production database and our warehouse

where certain rows would, you know, be different in the production database

than we're displaying in our data warehouse. And this led to a number of pain points in our analysis where especially around, candidate pipelines,

things like application stages

or yeah. Application stages is a good example of, like, kind of the presence of a candidate at a given stage. Some of those would get updated

in bulk and then not be updated in our data warehouse. So our our warehouse would say something about the state of the

pipeline, but it would not be accurate. And this is something that Data Coral, we ran into with Data Coral as well because their original ETLs

were based on polling and using the updated at timestamp. And so we realized that with Data Coral and went through a number of different strategies to try and fix this.

1 involved adding actually adding triggers to our production database to make the timestamps be automatic.

But the more that we thought about this, the more that our team was worried about actually implementing

something so heavy in our database just for the purpose of

of our ETLs and our data warehouse. So we ultimately decided to not go the route of implementing these triggers or, you know, custom

stored procedures in Postgres, and instead

started to think about using logical decoding

to stream changes,

from our our database. And so that's ultimately the the path that we went down and data coral did too. And I I think we've been extremely happy with the results. But, you know, that's an example of kind of some of the work we we did together to to to to try and figure out the best the best way to to get data out efficiently

and consistently. Yeah. Absolutely. I think this is a great example where,

for us, it was kind of Greenhouse pushing us to kind of get to the next level, and, we were kind of building it as we kind of had these conversations with them. And, again, having a customer who's able to kinda work through these kinds of situations or these kinds of, problems is actually incredibly valuable for an early stage company.

And, also, I think the timing was right. This is around the same time

that RDS in Amazon, like, was starting to support logical decoding.

We were able to just leverage that

and, be able to kind of provide,

a serverless way of actually pulling all the changes from these databases

and be able to apply the changes in the warehouse.

And

so once you made the cut over

and

started using data coral

more full time and were getting ready to sunset your existing infrastructure,

what were your evaluation criteria for determining that the data quality was sufficient, that you were

able

to replicate all of the prior capabilities,

and that everybody was able to do the work that they needed to do once you had made the cut over?

So yeah.

As to your question about,

kind of the big, you know,

this is the kind of the big wins or or knowing,

you know, when when things are ready. I think it was, as I as I said a little bit before, it was a bit of a gradual process for us. So

most of what we were building

in this test phase with data coral were were new,

kind of new,

analyses or new reports.

So I I think it was really once we we were confident that that the new reports had happy customers

that we

that we knew that we could kinda cut over everything. So the specific example, we have this,

I think the main kind of 1 of the guinea pig projects that we we took was we have this,

this set of dashboards

that we we call the I think it was originally called the QBR deck. Now it's EBR. So QBRs are quarterly business reviews. EBR means executive business review. I think it's a pretty common concept in SaaS,

enterprise

companies where

account manager or a customer success manager will sit down with a customer and talk about, hey, how are things going,

and how can we

make sure that

you are achieving your goals as a company,

and and how can you use our product to achieve those goals. So it's a a pretty common concept in in in this in the world that we, you know, in in SaaS software.

And so

we had our original EBR deck powered by those original ETLs.

And so that's where we would often run into these data inconsistencies that I was talking about before. And so what we decided to do was build out the new EBR dashboard,

which again would be the

dashboard in our BI tool that a CSM

would use to generate the the the charts for their specific customers

that they would then either, you know, you know, print out as a PDF or copy and paste into a PowerPoint presentation

to talk about with those customers. So we we decided to rebuild that using the data coral materialized views

and the data coral data. And so that was kind of 1 of the big first use cases, and it's it was something that I I think went over pretty successfully, and we're able to,

get a team of dozens of

customer success

managers

to use.

And so once kind of once that was up and running and they weren't using the old 1 anymore, that was kind of the signals to us that we could we could start to use data coral for more of these kinds of workflows.

So the next 1, which was pretty big as we automated our financial reporting. So

taking data from our transactional

financial, I guess, financial system with all the kind of, line items of what our customers are paying for, and reconciling that with Salesforce,

all in sequel through a series of materialized views that then create a dashboard for our finance team to know to basically build what we call the ARR momentum report, which is our annual recurring revenue momentum report that shows how much our customers are paying us, what, you know, what is changing over time, and and and breaking that down by segment and and slicing and dicing at different ways. And then allowing them to kind of download

a copy of that data from our BI tool that they can then pull into Excel,

massage

more to understand in different ways is a huge win, and our finance team is super happy with it. There's another use case that also involves the customer success team.

As I mentioned before, we

1 of our biggest stakeholders is our customer success team. This customer success team here at Greenhouse uses a tool called Totango,

to manage relationships with customers.

It's it's kind of a CRM type tool specifically

for customer success,

and, basically, it helps automate

keeping tabs on customers' health

and automating

communication with the customers,

like sending out,

email communications and surveys

and other things like that.

And,

the way that Tatango

works is you in order to understand customer health, you have to give it data about your customers. And so there are kind of 2 main ways to do that. They have a a a JavaScript

library you can throw on your page, and you kind of give it some JavaScript instructions, and it will instrument

your application and and kinda track your customers. And Tatango also has an API where you can send it,

events from a from a server, from a back end. And

so when we were evaluating to tango,

and this is kind of me wearing my product engineering hat, I'm as a product engineer, I'm always very hesitant to throw,

3rd party dependencies

into

kind of core,

product workflows.

As much as possible, I like to avoid putting JavaScript

additional JavaScript on a on a page that can either have errors or cause load, you know, cause load time to increase, or you're essentially running someone else's,

software

or someone else's code on your page. And so that always makes me nervous. So I knew that including the Tatango

JavaScript on our page was it was not something I was a fan of, and so something that I discouraged

our

our team from doing. And and I mentioned that during our discovery phase with Totango.

And so the

obvious next choice is to pipe data in through the API. So basically, Totango, you know, has an API and you can say here is an event, you know, this thing happened, you know,

customer navigated to this page or customer made a hire in our in our software, and and so that would, you know, contribute to the health score. And here again,

1 of you know, the naive implementation, I think, would be to actually litter the code base

itself, our production code, with calls,

instrumentation calls to to tango, you know, in the controller

that handles

a a hire being made in greenhouse,

we could fire off an event to to tango to say, hey. A hire was just made. But that would mean that our code base would start to get littered with this instrumentation

that we would then have to maintain over time. And as behavior change, we would have to to to change. And to me, that seemed like a very bad idea as well.

So I immediately,

suggested that, actually, instead of this being a product engineering problem, I would shift it on to the to the data side, to the data science, data engineering side.

And so Datacorl has a Tatango

publishing slice that allows us to send data to that Tatango

API.

But what's nice for us is we don't have to worry about what that API looks like, what shape of data they want, is all that is is really all that matters. And and it's actually what shape data Coral expects that data to be in such that they can send it to to to tango. And so what we're able to do is write a series of materialized views to transform the data into the right shape, and DataCorel handles the rest. It it will periodically

push those events to Totango,

and everything is kind of handled,

asynchronously

and and doesn't interrupt any of the the product. And and and the product engineers actually don't even worry about this. They don't think about it, and it's it's really nice for them not to have that on their mind at all. So this is kind of a fantastic

use case where we have data and we wanna send it to some other system, and so we can transform it and send it to that system without,

interrupting

you know, without getting in the way of any other work. And

so now that you have been using data coral for a while and you're able to get all of your ETL processes

done just using their capabilities without having to have any dedicated data engineers,

I'm wondering,

1, sort of how you would characterize the overall experience of yourself and the people who are directly working with data coral and the ways that you're using the time that you freed up with trying to maintain your prior ETL pipelines?

Yeah. I I think that's a a great question.

I think for the most part, the experience of the people on my team working with Datacore has been great. I I think we've had a a good working relationship, and I think there have been times where we've had to work through some stuff that was not ready. And and, you know, as Raghu mentioned, the experience that we've had is is probably a unique 1 having grown with Datacorl. But I think,

overall, it's really easy for

someone on a a data scientist on our team

to think about the data flows in the form of, like, SQL transformations,

essentially, materialized views. It's it's really kind of a common, like, a lingua franca, like, a, you know, common language that, you know, SQL is really easy for us to all understand. And having data go from, you know, 1 table to another to another as opposed

to kind of flowing through, you know, hard to define scripts or transformations that are are aren't so straightforward

makes it you know, makes the whole,

system really

approachable

for both our team and our collaborators.

So even even the slightly less technical folks in like, within embedded within the stakeholders, like, you know, the people that work in, CS operations or marketing operations or sales operations,

we can talk to them about materialized views

and collaborate on what those should look like and, you know, what the what the shape the data should be in. And,

and it's it it all kinda makes sense to everyone as opposed to being this black box of of, you know, data flowing, you know, all sorts of directions.

And, yeah, I I think in terms of what

our, you know, what we're doing with that spare time, I think it just I think we can basically pour that back into the higher leveraged activities like

analysis or

even predictive analytics.

I think,

you know, on the data engineering side, the people within our organization, the engineers that have helped with data engineering over over the years have been able to rather than being dedicated data, data engineers,

are able to to work on other parts of our in internal tooling system that is a bit higher leverage. For example, our CICD platform,

you know, how code gets from a developer's machine to production.

We, you know, are we have a we've built out an actual internal

pass here at Greenhouse where

we are able to, you know,

deploy to ephemeral dev environments.

And,

engineers that join our team product engineers that join our team are kind of awestruck with the some of the processes we have that allow this really easy testing and staging on the fly.

And so I think we invested a lot in that, and we've been able to invest a lot in that because we don't have to to work on data engineering.

I I think I'm not gonna chalk it all up to that. Like, I think there are a lot of other pieces to the puzzle. But I think just in short, as I said before, not having to worry about

these ETLs

allows us to focus on higher leverage activities.

And just wondering if you can also quickly talk through what the current workflow looks like for building and maintaining

the data flows that you're deploying onto the data coral platform

and just what the interaction pattern looks like and how you're managing and organizing the code that you're deploying for managing those data flows and how you ensure sort of discoverability

or visibility of

what the flows are. That's a that's a fantastic

That's that's kind of like the next I think that's an 1 of the next big things for our team as we scale, as we hire more data scientists. That's gonna be of extreme importance for us to to have that discoverability and that, you know, that structure that makes sense.

Because

if we don't, I think there's gonna be a lot of rework or, you know, stepping on each other's toes. I think this is also,

an area that data coral is working on, and and it's another piece that, like, I feel like. I I know that we feel the data coral's working on this a lot, and I I hope that they feel that we're contributing again on this front in terms of feedback. But I think there's some work to do here in terms of how to standardize

these workflows.

So to get into the nuts and bolts, basically, data coral provides a CLI tool. You know, you run data coral and then a command like data coral organize,

and then you can say, map view, materialize view, create, and then you specify a path

to a file, a a dot DPL file, data programming language file that's essentially a SQL a SQL command and with some comments and some annotations

on the top of it. And let's say, you know, what kind of a materialized view it is and what what is the frequency with which it should be refreshed

and so on. And so that SQL file is kind of, like, the source of that is the transformation that's going to happen. And so in a naive world, you'd basically have people just write, you know, write some SQL and then run-in through the CLI and and create these map views. Obviously, we want to be doing code review. We want to have you know, we want, you know, someone if someone's gonna create a new materialized view, we want someone else to approve it, and we want it all under version control. So we have a single Git repository

called Datacoral that contains all of our materialized views in a structure that makes sense to us. So we basically have the different schemas as top level directories.

So you can imagine,

a schema

roughly correlated with a kind of a use case.

So you could say, like, analytics

underscore CS, so, like, customer success analytics.

All the materialized views that power the dashboards that the CS team uses,

all all the materialized views there are in that directory, in that schema. But what we've had to do

is write some, you know, make files or scripts to to kinda make the process a little bit more streamlined. And then I think there's the the risk that we we don't have any kind of CIC, you know, continuous

integration or continuous deployment of these things. So we still have to run them manually. So even when we open a pull request, we've had to come up with our own process where we'll open a pull request,

say, hey. I'm I'm gonna create this materialized view. Can someone, like, take a look at it? And once it's approved,

then I use the CLI to deploy it. But there's no, like it's not being enforced and it's not being automated. And I'd love to get there, but I don't I think in some ways, like, there's some work for us to do, and in some ways, there's, work that Data Coral is doing to make this a bit more streamlined as well. Yeah. So, just to kind of give an exam yet another example of kinda how Greenhouse is helping us move forward on this. So 1 of the things that we have done right now is this whole kinda compile step. So earlier, people would be able to just kind of create these materialized views, and we would kinda automatically infer the dependencies and kind of generate the pipelines. But now, with these kind of DPL files that are all kind of in 1 kind of repository,

When you're trying to update 1 of those materialized use, you should be able to kind of run a compile step that'll then tell you if there's anything downstream that might actually get affected. Right? Because we know what the data dependencies are. And, again,

with Greenhouse kind of leading the way in terms of providing the right kind of use cases, we've been able to kind of get started on that compiled step. And then the idea would be that we'd actually provide, like, a CICD pipeline where you change kind of, that 1 materialize you somewhere in the middle of the DAG.

And then you should be able to,

not only push it to production, but be able to kind of run it, in, like, a test mode from that node all the way downstream so that you know what the difference is gonna be like after the test has been, so after the change actually has been applied.

So these are actually things that we are actively working on. And, and, again,

with,

Greenhouse,

helping lead the way in terms of providing the right kind of use cases. And so as you continue to work with Data Coral and the data and Raghu, as you continue to work with Greenhouse,

I'm wondering

what you're hoping to see in the future in terms of the platform evolution

or any plans that you have going forward to add new capabilities or capacity to data coral. Yeah. I think that that piece that we're just talking about is is 1 1 part of it, kind of the operationalization

production

icing

in those words. Basically, like,

making this whole

process scale,

and be discoverable

and and just, you know, this

the data warehouse is becoming its own production system.

And with any production system, you want some sort of a staged approach to

change management. You don't wanna just be doing it live. So what we just talked about with

the kind of the the different tools that can help kind of stage a change to the data pipeline and and show that. That's a that's a big piece of of what I'm kind of looking forward to in the future. And I think the

on the other end of the spectrum is is more,

you know while SQL is an incredible

way to express

these, data transformations,

I think

there are some use cases where

things are

a little bit more complicated or you you might wanna do something a bit more advanced. I think Ragu will probably speak more towards this, but I think building more sophisticated

data transformations

with using the same system will be incredibly valuable.

Yeah. So to basically add to,

kind of what I had mentioned, 1 of the things that we are hearing from Greenhouse and other customers is that they'd like to move beyond,

SQL to be able to kind of specify

more complicated transformations.

But we really like the whole set of abstractions that SQL provides around

kind of explicit data dependency specification

as well as kind of the abstraction around,

just saying, okay, what do you wanna get done, not how.

So we, have come up with this

abstraction called the user defined table generation function. Again, this is not new.

Query engines like Hive have had it for a long time.

But we have come up with a way where people can actually plug in their Python code to be able

to do kind of much more complicated transformations,

even do things like inference,

batch inference, and things like that. And you should be able to plug that into,

a data flow.

And the data flow specification itself is done in SQL because, I mean, that's how we are able to infer the data dependencies

that'll then,

generate the data pipelines.

So this is 1 of the features that we are actually, super excited about because that'll,

hopefully allow data scientists to do a lot more than, just write sequel.

And are there any other aspects of the work that you're doing at Greenhouse or the work that you're doing at Data Coral or the sort of interaction between the 2 companies that we didn't discuss yet that you'd like to cover before we close out the show?

Yeah. Actually, 1 of the things that I think maybe, Aaron, you can talk about, kind of, the business

requirement. And,

Duvay, just to, kinda point out here, this was 1 of the big use cases that Aaron and I had talked about earlier, but we didn't kind of get to it in terms of, I guess, the big wins aspect. This is around GDPR.

So, like, about, early last year,

you know, just trying to figure out how they're gonna get GDPR compliant on their

analytics warehouse. And 1 of the things,

that was also the driving factors for

the whole logical decoding aspect of, pulling data

was to be able to deal with hard deletes of the data. And that is something that we were able to clearly provide on the collect side. But once the data was in the analytics database, we wanted to get to a point where we're kind of proactively anonymizing data so that it was actually very easy for Greenhouse to be able to comply with the right to be forgotten rule. So, when this kind of requirement came along from Greenhouse, we actually worked with them pretty closely to kind of get to a point where even data that was coming from their kind of APIs,

from different tools like Salesforce and the Zendesk and Jira, we are able to actually anonymize all using the same kind of materialized view framework to then allow them to be kinda compliant to the right to be forgotten. Arun, you wanna add to, add to it? Yeah. No. I think,

it was around May last year that this this was happening, and I I think that we, you know, we wanted to to do everything that we could to be compliant. And and I think that 1 of our big worries was, like, well, if if we collect

all this data in our data warehouse and don't have, like, a easy way to to propagate the deletes, that would be a big exposure point for us. And so we wanted to make sure that we're able to to handle that.

So we as as Raghu mentioned, we worked with,

and I think when we started when we brought this up, this wasn't necessarily

top of the road map for Data Coral. But I think as we spoke about it more,

I think it was clear that it would be a big piece,

for any company that, you know, wanted to remain compliant. And so

we worked together to figure out, well, how do we move from the current implementation

that we have

to 1 that will be compliant? And so we were able to get there, and I think that that was, you know, a big win for us. And we were able to make our legal team happy in saying that we do, you know, we do comply and and propagate those those deletes.

And anything else that we should cover before we close out the show? Yeah. I mean, I think I just wanted to,

thank Aaron and, again

for the patience that they've had as we have kind of grown. We believe that,

it's a kind of, customer that, any startup could kind of dream of.

It's been, like, a really great kind of experience kinda working with, you know, so kind of look forward to,

working with them, closely going forward. Yeah. The other thing I, I I mentioned that we worked with Periscope Data as our BI tool, but that was a number of years ago. And and the current BI tool that we

work with is Mode Mode Analytics,

and I think we've been super happy with

with mode as as kind of the kinda primary window onto our data warehouse. And this is actually an interesting use case that we have. Metabase

is

a open source BI tool, and so Datacorl

provides a Metabase

slice. And so, basically, meta, you know, Metabase

is

a a really

kind of it's a pretty

powerful BI tool where you can write SQL queries against your data warehouse, your Redshift instance, and and see the results in the browser, which is obviously what a BI tool does. But I think that what we discovered in rolling out Metabase is that it was a little rough around the edges

for something that would

span the entire company. And so we decided to keep Metabase as, like, an internal tool for our engineering team, for our data scientists to kind of prototype queries,

but we decided to not roll it out across

the entire company. And the other part of that is that right now, the way that Metabase is organized is that it it's kind of has

relatively blanket access to our

Redshift database. And so anyone who has access to Metabase,

which is a small subset of people, have, you know, access to a lot of data.

And

and and that's good and bad thing. It's good because it allows us, again, prototype some of these queries, but we obviously don't want everyone at the company to have access to every piece of data. And so

what we've been able to do,

again, using materialized

views,

is we transform

subsets of data

into specific

schemas in Redshift, and those are the schemas that we give mode access to. And so another piece of this is that we don't, like, we don't transfer

PII or PI to,

the schemas to which Mode has access.

So, again, like, that's kind of, like, another layer of security and compliance is that we are able to use materialized views

to

sanitize

the data for

more public

by public, I mean, within greenhouse, of course,

more public consumption.

And so that's the kind of another interesting use case that we've

found to be a pretty big win for us here is that we can we can do that and,

feel,

sleep easier at night knowing that, you know, we don't have to have everyone act get have access to all the data,

which is, I think, probably a worry for a lot of people.

Yeah. Again, the fact that, Metabase is something that gets deployed inside of your VPC means that it is kind of only accessible through your VPN, kind of adds to that security.

Exactly. But, again, that's,

not everyone has access to our VPN. Like, the

customer success team, you know, isn't getting you know, isn't logging on to the VPN,

like, our engineering VPN.

So that was another

reason why we don't we didn't roll it out across the company because the technical hurdles were were too great. Alright. Well,

again, thank you both for that.

For anybody who wants to follow-up with either of you, I'll have you add your preferred contact information to the show notes. And so as a final question, I'd just like to get each of your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Aaron, starting with you. It's a good question.

Yeah. I think,

this is actually a piece that maybe I I would have touched on before, but, we didn't get into it. But I think the the thing we're seeing more and more is is having our customers demand real time data as opposed to or close to real time. And and so the idea of,

you know, either providing that data through,

you know, these data dumps,

like, once a day or,

you know, every hour are are becoming the idea that you could could transform and stream data in real time is is definitely

something that people want, but I think the,

you know, the tooling is still complex, you know, to get data out of a a production system and maintain it and and send it in real time. I think we're getting close, but I I think I'd like to see that. And I think that's something, again, that I've talked with, Ragu a lot about, and I think Data Coral is, you know, is thinking about it as well. It's like, how do we,

how do yeah. For leveraging logical decoding

to pull all the changes

that are coming from our Postgres database and into our data warehouse.

Are there,

kind of, efficient ways to hook into that to to transform it in real time? And, Raghu?

Yeah. I mean, from my perspective,

I just find kind of the biggest gap is in,

complexity of the whole,

kind of tool chain,

that's there across,

like, all kinds of, functionality that is needed for data management.

And it's,

still very hard. Even though there's quite a lot of options, it's actually pretty hard for any 1 company to be able to say, okay. Now I know exactly what

the right

end to end,

toolkit needs to be like. We are doing a little bit to, help standardize,

tooling for end to end data flows. But

as more and more companies have lots more data and lots more kinds of use cases, it's I only think that there's gonna be a

a problem that'll keep,

increasing where there are,

more and more options, and each 1 does, like, a small sliver of what you need. And now it's up to you to actually put them all together yourself.

And,

I think that's,

that's something that we are trying to put a dent into. Alright. Well, thank you both again for the time today to talk through your experiences.

It's definitely valuable to get some sort of insight insight into the ways that different people are running their engineering and managing their data platforms. So I appreciate the both of you taking the time today, and I hope you each enjoy the rest of your day. Thank you. Thank you so much,

Tobias.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links