Summary
Building an ETL pipeline can be a significant undertaking, and sometimes it needs to be rebuilt when a better option becomes available. In this episode Aaron Gibralter, director of engineering at Greenhouse, joins Raghu Murthy, founder and CEO of DataCoral, to discuss the journey that he and his team took from an in-house ETL pipeline built out of open source components onto a paid service. He explains how their original implementation was built, why they decided to migrate to a paid service, and how they made that transition. He also discusses how the abstractions provided by DataCoral allows his data scientists to remain productive without requiring dedicated data engineers. If you are either considering how to build a data pipeline or debating whether to migrate your existing ETL to a service this is definitely worth listening to for some perspective.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to dataengineeringpodcast.com/clubhouse today and signing up for a free trial. Support the show and get your data projects in order!
- You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Aaron Gibralter and Raghu Murthy about the experience of Greenhouse migrating their data pipeline to DataCoral
Interview
- Introduction
- How did you get involved in the area of data management?
- Aaron, can you start by describing what Greenhouse is and some of the ways that you use data?
- Can you describe your overall data infrastructure and the state of your data pipeline before migrating to DataCoral?
- What are your primary sources of data and what are the targets that you are loading them into?
- What were your biggest pain points and what motivated you to re-evaluate your approach to ETL?
- What were your criteria for your replacement technology and how did you gather and evaluate your options?
- Once you made the decision to use DataCoral can you talk through the transition and cut-over process?
- What were some of the unexpected edge cases or shortcomings that you experienced when moving to DataCoral?
- What were the big wins?
- What was your evaluation framework for determining whether your re-engineering was successful?
- Now that you are using DataCoral how would you characterize the experiences of yourself and your team?
- If you have freed up time for your engineers, how are you allocating that spare capacity?
- What do you hope to see from DataCoral in the future?
- What advice do you have for anyone else who is either evaluating a re-architecture of their existing data platform or planning out a greenfield project?
Contact Info
- Aaron
- agribralter on GitHub
- Raghu
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Greenhouse
- Datacoral
- Airflow
- Podcast.init Interview
- Data Engineering Interview about running Airflow in production
- Periscope Data
- Mode Analytics
- Data Warehouse
- ETL
- Salesforce
- Zendesk
- Jira
- DataDog
- Asana
- GDPR
- Metabase
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With 200 gigabit private networking, scalable shared block storage, speedy SSDs, and a 40 gigabit public network, you get everything you need to run a fast, reliable, and bulletproof data platform. And if you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and 1 opening in Mumbai at the end of the year. And for your machine learning workloads, they just announced dedicated CPU instances where you get to take advantage of their blazing fast compute units.
Go to data engineering podcast.com/linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers for engineers. Clubhouse lets you craft a workflow that fits your style, including per team tasks, cross project epics, a large suite of prebuilt integrations, and a simple API for crafting your own. With such an intuitive tool, it's easy to make sure that everyone in the business is on the same page, And data engineering podcast listeners get 2 months free on any plan by going to data engineering podcast.com/clubhouse today and signing up for a free trial.
Support the show and get your data projects in order. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Dataversity, and the Open Data Science Conference. And coming up this fall are the combined events of Graphorum and the Data Architecture Summit in Chicago. The agendas have been announced, and super early bird registration is available until July 26th for up to $300 off.
Or you can get the early bird pricing until August 30th for $200 off your ticket. Use the code b n l l c to get an additional 10% off any pass when you register. And go to data engineering podcast.com/conferences to learn more and to take advantage of our partner discounts when you register for this and other events. And you can go to data engineering podcast.com to subscribe to the show. Sign up for the mailing list, read the show notes, and get in touch. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers. Your host is Tobias Macy. And today, I'm interviewing Aaron Gibralter and Raghu Murthy about the experience of Greenhouse migrating their data pipelines to Data Coral. So Aaron, could you start by introducing yourself? Sure. Thank you, Tobias.
[00:03:01] Unknown:
Thank you for having me. Yeah. Again, my name is Aaron Gibralter. I'm 1 of the directors of engineering here at Greenhouse. I work with 2 different teams. I'm both on the product engineering side, running a team building 1 of our products, and I also work with our data science and data engineering teams. Greenhouse is a talent acquisition suite that helps companies acquire and retain the best talent. And, yeah, I guess as as as is the case for most companies these days that build software, data is incredibly
[00:03:34] Unknown:
important to us. And, Raghu, could you introduce yourself as well? Absolutely.
[00:03:38] Unknown:
Yeah. My name is Raghu Moorthy. I'm the founder and, CEO of Datacor, a company where we are automating the building of data pipelines, and we have built it in such a way that it's all serverless. We have, been working with Greenhouse over the past couple of years, so I'm really excited about conversation where we're able to kind of just
[00:03:57] Unknown:
take folks through the journey that Greenhouse went through, while we were growing and while they were getting started on using their data as well. And also for anybody who wants to dig deeper into data coral itself and your experience of building and growing that company, I'll refer them back to the other interview that we did with you, and I'll add a link to that in the show notes. And so going back to you again, Aaron, can you do you remember how you first got involved in the area of data management? Yeah. I I think it's a bit of an interesting story for me. As I mentioned before, I work with 2 different teams and 2 different disciplines.
[00:04:32] Unknown:
Product engineering and data are a bit different. And so the data piece was a little bit more happenstance. I think when I joined Greenhouse 4 years ago and not too long into my tenure, my boss, now our CTO, the VP of engineering at the time, like, before it asked me, to step in and and get involved with our data scientists at the time we had 1. His his name is Andrew Zerm. He came from academia. He was a, astrophysics professor, PhD, and he had gone through 1 of the data science accelerate our boot camps, to kind of transition from academia to the business world. And so he joined Greenhouse and kind of started building out some of our the data pipelines that we used to wrangle the data and also started to build out our reporting and analytics capabilities and as well as, some machine learning stuff. But he was alone working on the team and, Mike Buford asked me to start to work with him, just to to to think about different use cases. And so so that's how I got involved. It was kind of a a need the company had. I had a little bit of spare bandwidth at the time, and so it was, you know, just luck in a sense. But it it's definitely
[00:05:50] Unknown:
become a a huge interest for me, and I've I've become quite passionate about both the data science and data engineering side. And, Raghu, for anyone who hasn't listened to your prior interview, can you share again how you first got introduced to the area of data management? Yeah. So I've been an engineer working on data infrastructure and distributed systems for a while. Starting back in the day, I had to meet all the data pipelines that, typically need to get built out, for these teams.
[00:06:15] Unknown:
And so going back to you, Aaron, you mentioned a bit about what it is that Greenhouse does. So I'm curious if you can talk a bit more about some of the ways that you use data within the business and your overall data infrastructure
[00:06:28] Unknown:
as it was before you made the move to data coral. Sure. It's it's it's interesting. I I think we've come a long way. As I mentioned, Greenhouse, is a a a talent acquisition suite. We have a a number of different products that, help companies with their hiring and onboarding process. I I think it's SaaS software. So, you know, companies buy our software, and then it's, you know, provided through the web. And as a as a result I I mean, I I don't wanna sound like it's it's unique to us. Like, I think pretty much any software company now generates a a ton of data every user act user interaction and kind of the the state of the database at any given time has a lot of meaning. And so for us, the use cases for data, which is kind of like this big amorphous blob term, kind of span spends the gamut from kind of us understanding our customers and how, you know, how much value they're getting out of our platform to helping our customers understand their own data even better and kind of everything in between. The state of the of our data pipelines, before Data Coral were kind of, as I mentioned, we had 1 data scientist who is a little bit of a he's a he's a polymath. He's a generalist. He's always been interested in everything from the infrastructure side to the modeling side. And so he actually dug in and and ended up building out our ETLs, himself.
And so he I think what our infrastructure team did at the time, this is 3 or 4 years ago is they they set up an EC 2 instance for him in our VPC and gave him his own RDS and said, Do whatever you want, go to town. And so he stood up airflow there and built out a series of, ETLs to start to pull data from our production. Databases are just connecting to our followers and, and in in our infrastructure, and and and just pull data out and reshaped it into the data science warehouse, which was his RDS. And then he connected, you know, BI tool. I think at the time we were using Periscope. And so started to build dashboards on top of it so that we could understand our our customers behavior.
And we had a feature at the time. I think it was called the matrix. And the idea was it was kind of like a matrix of every feature in our product and every customer and, like, check boxes on what, you know, which features, which customers we're using. And that gave our both our product team and our customer success team a better understanding of, you know, the lay of the land. And so that was kind of the first stage in exposing this data. I think you just going back to your question, what the use cases now have expanded the company is much bigger and I think we've even gone into building predictive features for our customers. So we have a feature in our product called Greenhouse Predicts.
And what it does is it looks at the status of a particular job pipeline. How many candidates are in every stage? How many candidates are an application review, initial screen, phone screen on-site, and so on. And we built a model to predict what when we expect a hire to be made. And so that's a feature that we offer to our customers. And so that requires us, you know, to train the model. We pull data out of our data warehouse in a particular shape, and then we build the model, train it, and, and then, have deployed an inference engine to allow our application to that we that we have. But most of as I mentioned before, most of it was ad hoc, like point to point ETLs before data coral. I know that was a bit of a a long winded answer to your question. No. That was great.
[00:10:26] Unknown:
And just to get a bit of a flavor too, I'm curious about the just overall team size and structure that you're working with for the data engineer versus data science breakdown and the sort of overall number of customers for that infrastructure.
[00:10:42] Unknown:
Sure. And by customers, do you mean? In internal users. So when we started, as I mentioned, we we literally just had 1 data scientist with some support from product engineering and infrastructure, doing everything. So at the at the time when I joined the company, it's 4 years ago, I think our engineering team was probably around 20 to 30 people, and the whole company was around 70 people. And we had 1 data scientist doing everything. And now we are the company is 300 or over 300, and, our engineering team is 70 to 100.
You know, it's it's always hard to keep track with how much we're growing. But we still have a pretty small data science team. And I I I'll go into this a bit more later on. But I think what data coral has provided us is it it gives us the ability to punch above our weight class in terms of our data science team and what we're able to provide. So we have no data engineers at Greenhouse. No we have some teams that help with data engineering, but we have no dedicated data engineers. And we have 2 data scientists and we're we're growing right now. So we'll be bringing on another, let's say, 2 to 3 data scientists this year.
And, and with no expectation or need to to hire any data engineers to support that. And and largely in part because Data Coral gives us the tools to to manage it ourselves without needing to, to handle the infrastructure. The number of internal stakeholders we have is grown a lot. We've predominantly started with customer success as our main stakeholder, but we have moved into working a lot with marketing, finance, sales, support and engineering, R and D. So understanding our own site performance, kind of our engineering throughput, our product OKRs.
So while there's so much that we can do and so much more that we can do and that's why we're growing. I think as I said, I think we're able to punch above our weight class and do a lot more because we have because we've, adopted data coral, and it it it provides us a lot of of leverage.
[00:12:58] Unknown:
And can you talk through a bit more about the types of data sources that you're working with for pulling into your data warehouses and into your analytics?
[00:13:08] Unknown:
Sure. At first, when we managed our own ETLs, it was mainly just our production database. So we run Postgres as our as our main production database for our application. So it was mainly pulling data out of that about what our customers, you know, we're doing in in in the product. Like, what, you know, what data artifacts are being left over from user interactions. And, I think we may have built out some Salesforce ETLs to pull, data from Salesforce because Salesforce is a source of truth around customer, information like key points of contact and in, you know, segmentation like industry size or what industry the company is in or what their addresses for, like, look company location.
So those things we were pulling from from Salesforce. And now we pull in from a lot of, sources into our data warehouse with data coral. So the main sources we have, again, the production databases is probably the most important, but we also pull in Salesforce data, Zendesk data, Jira. And so now, you know, Salesforce, Zendesk, and the the production database are all more customer data, whereas Jira is our kind of internal internal data. You know, you know, what cards were shipped, you know, what features were shipping and how quickly. And so now getting in, we're actually going deeper into that, pulling data from Datadog and, and Asana. So other project management software. And so that's been incredibly helpful too to have in our in our warehouse.
[00:14:43] Unknown:
Yeah. Being able to correlate sort of general usage patterns with when a specific feature might have been shipped, I can definitely see as being very valuable.
[00:14:52] Unknown:
Exactly.
[00:14:53] Unknown:
And so in terms of the data pipeline as it existed,
[00:14:59] Unknown:
sort of leading up to the point where you started looking around for alternatives, I'm curious what were the biggest pain points that you were dealing with and the ultimate motivation that led you to reevaluate your approach to the ETL processes that you were managing.
[00:15:13] Unknown:
Yeah. I think it was a it was a combination of things that led to the reevaluation. I think the biggest pain points were, I think, at the time, our our team was was quite small, so it was just 1 person. So there was a lot of concern around the single point of failure. All these ETLs being built out in kind of like a custom way, you know, with 1 person building them, you know, something, you know, if, you know, We always talk about bus factor in in engineering. I like to maybe think about more like the lotto factor or something a little less dark. But, you know, for some reason, Andrew, you know, if he decided to leave, I think we would have been in a really bad state.
You know, there's very little he was kind of working in his own world. We even called his his part of our VPC. His last name is, as I mentioned is Andrew Zerm. It we called it VPZ, virtual private Zerm. And, you know, whenever you have something like that, it kind of raise it. You know, it's it's it's a bit of a organization smell or robustness smell. And so 1 of the biggest pain points was that idea of the single point of failure. The other thing is maintenance of it was was taking up a fair amount of his time. Ideally, a data scientist should be spending most of his or her time in some sort of leverage capacity around data as opposed to working on the plumbing. And so, you know, I think probably at the time, I would have to, you know, ask him. But my impression was that he spent at, you know, between 25 percent and maybe the worst times, 50% of his time, kinda wrangling data as opposed to spending it on analysis or predictive pieces.
[00:16:58] Unknown:
So that, was definitely a big pain point was, you know, how much time we were investing in it. And so once you came to the point where you decided that the current state of affairs wasn't really gonna be tenable in the long run, I'm curious what your criteria were for determining what a replacement would look like, whether that was a build versus buy decision or if you were looking at bringing on dedicated data engineers on staff and the types of tools that they might be interested in and just the overall process of going through that evaluation and then the ultimate decision making process?
[00:17:33] Unknown:
Yeah. I I think it, the way you put it there is a generous way. I think in some ways, it was less intentional. In many ways, we got lucky. So I think things were working. I think in some sense, the cost the main costs or pain points were opportunity costs. I think because I was new to the field, I didn't really know what good look like. So I think we were okay in some ways with the status quo. And I think luckily, we had an advisor through 1 of our investors that would speak with Andrew on a regular cadence. And he he mentioned that we should consider, looking at Data Coral and thinking about our our data stack. And so it was through that that we started to explore it and and realized kind of what the possibilities would be.
But you know, in some ways, I wish that we had been more intentional earlier on and saying, hey, what does this look like a year from now? But I think we were just kind of treading water and doing the best we could with the situation. And so I do think that in retrospect, if I were to do it again, I'd have a much better perspective on, like, on efficiency and and where we're spending our time, and and what, you know, what leverage looks like. But, I think ultimately, we got lucky and that data coral kinda came along, and and we were also, you know, I think it was also very early in data. It was, you know, very early in data corals existence as well. So it's kind of like a, you know, fortuitous moment or, you know, it was just luck that we kinda found each other at the at the right time, and we could grow together and and get to a much better place. And so
[00:19:15] Unknown:
after you got introduced to Ragu and Data Coral and made the decision that you were going to replace your existing overall experience was of getting onboarded and making the cutover and determining the sort of data quality, ensuring that it matched what you were already getting, you know, or that it was improved and the, overall experience of sort of drawing down the previous infrastructure in favor of data coral. Yeah. Definitely. I I think another
[00:19:47] Unknown:
just going back a little bit, another, big piece in terms of our evaluation criteria. But I think the 1 thing we knew all along was that we never wanted our data to leave our our VPC. I think as is the case with a lot of companies, are we I mean, maybe not, but we we care a lot about the security and and, and compliance piece for us is of extreme importance. We have a lot of very sensitive information at Greenhouse, and we obviously, you know, would never want we would never use our customers data in a way that we weren't explicit about. And so 1 piece there is that we we knew that in any world, it was either something we would have to build internally, but the idea of a vendor where we would actually have to pipe data into someone else's warehouse or through someone else's pipes was not really an option. And so that was I think Data Coral was built that with the with the idea that the data stays in your own infrastructure from the start. And so I think when we found that out, that's kind of 1 of the barriers like, 1 of those barriers to entry was immediately lowered, and it was something that we were interested in in testing out. And so I I think yeah. I I I think that that, that was a really important piece in the evaluation process and allowed us to test out a use case without without, like, moving mountains. Because I think right now, every time we had a sub processor, it kind of has to go into all the contracts or all the customers have to be made aware that, their data is going to, you know, be, you know and we have a number of, you know, we use AWS. So, like, I think it it's not possible. We're not like a building our own hardware and and and and doing everything on prem. And so it's not crazy that we would send or use a vendor to to achieve something, you know, achieve some value for our customers.
But, in this particular use case, we knew that we wanted to to keep, the data flow in in our own VPC. So when when we were talking about the cut over process and using data coral, I think it was quite easy for us. I we we did a security and compliance review to make sure that that kind of all the cloud formation templates that the data coral is handing us and and all the the stuff that they were providing us was sound. But beyond that, we I think it was easier for us to give it a try because we knew that it was running in our own infrastructure. And so that was kind of the lens that we use is like, hey, we can stand this up. Let's try it out and see how it works. And if, you know, if it seems like it's if it's, if it seems like it's worthwhile or if it's if it's if it's easy and and provides value to us, then we'll continue. And if not, we can we can easily spin it down and and undo that. And so that was a a pretty easy way for us internally to evaluate it. And, and I I think I talked to Ragu a bit about this, but because because we started using DataCoral very early in DataCorals existence, I think that the cutover process was definitely not cut and dry. I think there was a lot of us working together to figure out the solutions to our problems, but I think we got to a really amazing place. And I think that process now for a company evaluating DataCore will be quite different. I think DataCore is quite mature now. And so I think it that whatever I say here would be very different for someone trying it out now. But, yeah, that, that cutover process probably took I don't know. Raghu, do you have a sense of that is probably over the course of a year that we adopted Data Coral, and then we're finally able to sunset our existing ETLs. Maybe even longer than that before we before we sunset our existing ETLs. But I I think that if we did it today, it would be much faster. Yeah. So,
[00:23:44] Unknown:
maybe I can, add a few things here. So we started talking to Greenhouse very, very early on in our existence. And, clearly, my goal, at that point in time was to see the kind of the architecture that I had come up with and we are kinda working on. Would that actually make sense? And the great thing about Greenhouse has been, as you can imagine, for an early stage company, the most valuable thing is, time from potential customers and patients. So they were kind of Aaron and, his team, they were convinced about the architecture. And as you mentioned, the fact that, we are running within their VPC load, the barrier to entry for their for them to actually try it out. And then just like what we do even now for our existing customers, it's start off as, like, 1 use case. I think the first use case was to, pull data from their production environment and, move it into a data warehouse, Redshift in this case, and essentially get Andrew Zerm out of kind of having to query, like, like, a follower database, of the production data of this, follower of the production database or, like, having kind of do anything kind of in an ad hoc manner. Instead, it is actually a better thing to have him, just focus on doing the analysis itself, and he could do that directly on Redshift.
So that is kind of the first take for us to just kind of replicate 1 data source. And then slowly but surely, we started adding more and more connectors. And, like Aaron mentioned before, they're kind of pulling from Salesforce, Zendesk, Jira, and all, like, now, I think, at least, I have lost track of the number of different sources from which they're collecting data. But in terms of the cutover period, like, the way that typically happens, right, is that if they if companies typically have something already, that is kind of processing their data, they don't wanna just rip and replace everything because that actually is, is much harder to do. Instead, the better thing to do is to kind of take 1 use case and and actually make it work. And our our whole kind of microservices based architecture means that we can live alongside whatever companies might have. So as more and more use cases came along where people were able to just use Datacorel directly instead of whatever they had, originally, So there were net new use cases that were being directly worked on in using data coral. And, of course, the existing stuff can can be moved or you can kinda keep it around or sunset them or however you wanna, like, plan it out. The goal for data coral is to make sure that whatever, use cases you wanna actually move, those are the ones that we can actually make super easy. And, again, very early on it, in, our engagement with Greenhouse, given that we had essentially, like, an overall idea of the architecture and, like, an initial kind of pre alpha even implementation.
We did kind of go through a lot of learnings while working with, with Greenhouse. And even the whole kind of the security architecture and stuff like that, I mean, we work pretty closely with Greenhouse to get it to a point where there's not only that their security, team was happy with it, but we were able to leverage that work and, get a lot out of it even with our other customers or even while working with AWS to get to the advanced technology partnership. There's a whole set of questions in the questionnaire, as you can imagine, when you're going trying to get through these compliances or these kind of partnership levels around security and whether you're a data processor. All of those questions essentially were kind of completely irrelevant to us because all of our software was running within the customer VPC. Yeah. That the security piece, I think, we work through
[00:27:12] Unknown:
very closely together. You know, as I said, security is of utmost concern to us. And so when we I think that was the the hardest part of of getting started with Data Coral, not in a bad way, but we just had to figure out an architecture that made sense such that Data Coral would not have access to any of the underlying data. And we were able to get there, And so it was really a great experience working with data coral to to make sure that that was the case. And I'm glad that it has contributed. And and and I'm I'm happy that it contributed to kind of the overall arc like, the standard architecture for for how data coral does these engagements.
[00:27:48] Unknown:
And, Raghu, I'm also curious about some of the sort of edge cases or sort of sharp points in your infrastructure and architecture that ended up getting ironed out in the process of onboarding Aaron and Greenhouse and any of the other,
[00:28:05] Unknown:
customers that you were working with in the in the sort of similar time frame? Yeah. Absolutely. I mean, 1 of the main things that you realize after kind of when you're building a system and then getting a bunch of customers, at least initial users of the platform or, like, the system is that, there there are, as you said, like, sharp edges. Right? So it's very easy to get into bad states. The amount of error checking or, like, error propagation kind of you don't pay as much attention to it early on because you're mainly trying to establish the viability of the technology overall. So for the most part, like, at least initially, we would have to handhold customers through kind of setting up data coral, like setting up these connectors. And then as the data was kind of flowing, they would be able to kind of fend for themselves. But then if there are errors and stuff like that, instead of them knowing about it through, like, a tool or whatever, they might kind of run into problems because, hey, the data is not fresh. What happened? So we had, like, like, a Slack channel where people could just ping us. And along the way, we have, clearly made our, overall platform a lot more robust. We have gotten to a point where our customers typically don't have to kind of actually worry about kind of data quality. We kind of catch errors, sooner, than anybody else can actually kind of notice it. We are able to fix them. And, again, all of this is happening because you're providing this whole, kind of, automated data pipelines as a service. Right? So we use this notion of a cross account role that allows us to monitor everything that's happening in the in the customer in the in the installation while still not having access to any of the data. The data itself is encrypted using keys that our roles don't have access to. So, this whole combination of providing, like, a SaaS offering, but within the customer VPC, I think that has allowed us to, in some sense, give, give ourselves the time to build out the automation while still kind of using operations to make sure that everything is actually working well. And I could I don't know if you want me to go into these details, Raghu.
[00:30:02] Unknown:
But I I think it was there's some interesting, like, getting into the nuts and bolts. 1 of the pain points we ran into, I think the original ETL system that Data Coral was working with and frankly, what similar to what we had going in our own airflow ETLs was this concept of pulling data out by the updated adcom. We have timestamp columns in our Postgres database that presumably say the last time the row was touched. Unfortunately for us, our application, the timestamp columns are not trigger based, in our database. So it's up to the application or the query writer to always update the timestamp, at the update time. And so we found that there were actually cases of bulk updates that we did that would affect large number of rows, but not touch the updated at timestamp.
And so we actually had data quality, issues or data consistency issues between our production database and our warehouse where certain rows would, you know, be different in the production database than we're displaying in our data warehouse. And this led to a number of pain points in our analysis where especially around, candidate pipelines, things like application stages or yeah. Application stages is a good example of, like, kind of the presence of a candidate at a given stage. Some of those would get updated in bulk and then not be updated in our data warehouse. So our our warehouse would say something about the state of the pipeline, but it would not be accurate. And this is something that Data Coral, we ran into with Data Coral as well because their original ETLs were based on polling and using the updated at timestamp. And so we realized that with Data Coral and went through a number of different strategies to try and fix this.
1 involved adding actually adding triggers to our production database to make the timestamps be automatic. But the more that we thought about this, the more that our team was worried about actually implementing something so heavy in our database just for the purpose of of our ETLs and our data warehouse. So we ultimately decided to not go the route of implementing these triggers or, you know, custom stored procedures in Postgres, and instead started to think about using logical decoding to stream changes, from our our database. And so that's ultimately the the path that we went down and data coral did too. And I I think we've been extremely happy with the results. But, you know, that's an example of kind of some of the work we we did together to to to to try and figure out the best the best way to to get data out efficiently
[00:32:46] Unknown:
and consistently. Yeah. Absolutely. I think this is a great example where, for us, it was kind of Greenhouse pushing us to kind of get to the next level, and, we were kind of building it as we kind of had these conversations with them. And, again, having a customer who's able to kinda work through these kinds of situations or these kinds of, problems is actually incredibly valuable for an early stage company. And, also, I think the timing was right. This is around the same time that RDS in Amazon, like, was starting to support logical decoding. We were able to just leverage that and, be able to kind of provide, a serverless way of actually pulling all the changes from these databases and be able to apply the changes in the warehouse.
[00:33:28] Unknown:
And so once you made the cut over and started using data coral more full time and were getting ready to sunset your existing infrastructure, what were your evaluation criteria for determining that the data quality was sufficient, that you were able to replicate all of the prior capabilities, and that everybody was able to do the work that they needed to do once you had made the cut over?
[00:33:57] Unknown:
So yeah. As to your question about, kind of the big, you know, this is the kind of the big wins or or knowing, you know, when when things are ready. I think it was, as I as I said a little bit before, it was a bit of a gradual process for us. So most of what we were building in this test phase with data coral were were new, kind of new, analyses or new reports. So I I think it was really once we we were confident that that the new reports had happy customers that we that we knew that we could kinda cut over everything. So the specific example, we have this, I think the main kind of 1 of the guinea pig projects that we we took was we have this, this set of dashboards that we we call the I think it was originally called the QBR deck. Now it's EBR. So QBRs are quarterly business reviews. EBR means executive business review. I think it's a pretty common concept in SaaS, enterprise companies where account manager or a customer success manager will sit down with a customer and talk about, hey, how are things going, and how can we make sure that you are achieving your goals as a company, and and how can you use our product to achieve those goals. So it's a a pretty common concept in in in this in the world that we, you know, in in SaaS software.
And so we had our original EBR deck powered by those original ETLs. And so that's where we would often run into these data inconsistencies that I was talking about before. And so what we decided to do was build out the new EBR dashboard, which again would be the dashboard in our BI tool that a CSM would use to generate the the the charts for their specific customers that they would then either, you know, you know, print out as a PDF or copy and paste into a PowerPoint presentation to talk about with those customers. So we we decided to rebuild that using the data coral materialized views and the data coral data. And so that was kind of 1 of the big first use cases, and it's it was something that I I think went over pretty successfully, and we're able to, get a team of dozens of customer success managers to use.
And so once kind of once that was up and running and they weren't using the old 1 anymore, that was kind of the signals to us that we could we could start to use data coral for more of these kinds of workflows. So the next 1, which was pretty big as we automated our financial reporting. So taking data from our transactional financial, I guess, financial system with all the kind of, line items of what our customers are paying for, and reconciling that with Salesforce, all in sequel through a series of materialized views that then create a dashboard for our finance team to know to basically build what we call the ARR momentum report, which is our annual recurring revenue momentum report that shows how much our customers are paying us, what, you know, what is changing over time, and and and breaking that down by segment and and slicing and dicing at different ways. And then allowing them to kind of download a copy of that data from our BI tool that they can then pull into Excel, massage more to understand in different ways is a huge win, and our finance team is super happy with it. There's another use case that also involves the customer success team.
As I mentioned before, we 1 of our biggest stakeholders is our customer success team. This customer success team here at Greenhouse uses a tool called Totango, to manage relationships with customers. It's it's kind of a CRM type tool specifically for customer success, and, basically, it helps automate keeping tabs on customers' health and automating communication with the customers, like sending out, email communications and surveys and other things like that. And, the way that Tatango works is you in order to understand customer health, you have to give it data about your customers. And so there are kind of 2 main ways to do that. They have a a a JavaScript library you can throw on your page, and you kind of give it some JavaScript instructions, and it will instrument your application and and kinda track your customers. And Tatango also has an API where you can send it, events from a from a server, from a back end. And so when we were evaluating to tango, and this is kind of me wearing my product engineering hat, I'm as a product engineer, I'm always very hesitant to throw, 3rd party dependencies into kind of core, product workflows.
As much as possible, I like to avoid putting JavaScript additional JavaScript on a on a page that can either have errors or cause load, you know, cause load time to increase, or you're essentially running someone else's, software or someone else's code on your page. And so that always makes me nervous. So I knew that including the Tatango JavaScript on our page was it was not something I was a fan of, and so something that I discouraged our our team from doing. And and I mentioned that during our discovery phase with Totango. And so the obvious next choice is to pipe data in through the API. So basically, Totango, you know, has an API and you can say here is an event, you know, this thing happened, you know, customer navigated to this page or customer made a hire in our in our software, and and so that would, you know, contribute to the health score. And here again, 1 of you know, the naive implementation, I think, would be to actually litter the code base itself, our production code, with calls, instrumentation calls to to tango, you know, in the controller that handles a a hire being made in greenhouse, we could fire off an event to to tango to say, hey. A hire was just made. But that would mean that our code base would start to get littered with this instrumentation that we would then have to maintain over time. And as behavior change, we would have to to to change. And to me, that seemed like a very bad idea as well.
So I immediately, suggested that, actually, instead of this being a product engineering problem, I would shift it on to the to the data side, to the data science, data engineering side. And so Datacorl has a Tatango publishing slice that allows us to send data to that Tatango API. But what's nice for us is we don't have to worry about what that API looks like, what shape of data they want, is all that is is really all that matters. And and it's actually what shape data Coral expects that data to be in such that they can send it to to to tango. And so what we're able to do is write a series of materialized views to transform the data into the right shape, and DataCorel handles the rest. It it will periodically push those events to Totango, and everything is kind of handled, asynchronously and and doesn't interrupt any of the the product. And and and the product engineers actually don't even worry about this. They don't think about it, and it's it's really nice for them not to have that on their mind at all. So this is kind of a fantastic use case where we have data and we wanna send it to some other system, and so we can transform it and send it to that system without, interrupting you know, without getting in the way of any other work. And
[00:41:38] Unknown:
so now that you have been using data coral for a while and you're able to get all of your ETL processes done just using their capabilities without having to have any dedicated data engineers, I'm wondering, 1, sort of how you would characterize the overall experience of yourself and the people who are directly working with data coral and the ways that you're using the time that you freed up with trying to maintain your prior ETL pipelines?
[00:42:07] Unknown:
Yeah. I I think that's a a great question. I think for the most part, the experience of the people on my team working with Datacore has been great. I I think we've had a a good working relationship, and I think there have been times where we've had to work through some stuff that was not ready. And and, you know, as Raghu mentioned, the experience that we've had is is probably a unique 1 having grown with Datacorl. But I think, overall, it's really easy for someone on a a data scientist on our team to think about the data flows in the form of, like, SQL transformations, essentially, materialized views. It's it's really kind of a common, like, a lingua franca, like, a, you know, common language that, you know, SQL is really easy for us to all understand. And having data go from, you know, 1 table to another to another as opposed to kind of flowing through, you know, hard to define scripts or transformations that are are aren't so straightforward makes it you know, makes the whole, system really approachable for both our team and our collaborators.
So even even the slightly less technical folks in like, within embedded within the stakeholders, like, you know, the people that work in, CS operations or marketing operations or sales operations, we can talk to them about materialized views and collaborate on what those should look like and, you know, what the what the shape the data should be in. And, and it's it it all kinda makes sense to everyone as opposed to being this black box of of, you know, data flowing, you know, all sorts of directions. And, yeah, I I think in terms of what our, you know, what we're doing with that spare time, I think it just I think we can basically pour that back into the higher leveraged activities like analysis or even predictive analytics.
I think, you know, on the data engineering side, the people within our organization, the engineers that have helped with data engineering over over the years have been able to rather than being dedicated data, data engineers, are able to to work on other parts of our in internal tooling system that is a bit higher leverage. For example, our CICD platform, you know, how code gets from a developer's machine to production. We, you know, are we have a we've built out an actual internal pass here at Greenhouse where we are able to, you know, deploy to ephemeral dev environments.
And, engineers that join our team product engineers that join our team are kind of awestruck with the some of the processes we have that allow this really easy testing and staging on the fly. And so I think we invested a lot in that, and we've been able to invest a lot in that because we don't have to to work on data engineering. I I think I'm not gonna chalk it all up to that. Like, I think there are a lot of other pieces to the puzzle. But I think just in short, as I said before, not having to worry about these ETLs allows us to focus on higher leverage activities.
[00:45:12] Unknown:
And just wondering if you can also quickly talk through what the current workflow looks like for building and maintaining the data flows that you're deploying onto the data coral platform and just what the interaction pattern looks like and how you're managing and organizing the code that you're deploying for managing those data flows and how you ensure sort of discoverability or visibility of
[00:45:39] Unknown:
what the flows are. That's a that's a fantastic That's that's kind of like the next I think that's an 1 of the next big things for our team as we scale, as we hire more data scientists. That's gonna be of extreme importance for us to to have that discoverability and that, you know, that structure that makes sense. Because if we don't, I think there's gonna be a lot of rework or, you know, stepping on each other's toes. I think this is also, an area that data coral is working on, and and it's another piece that, like, I feel like. I I know that we feel the data coral's working on this a lot, and I I hope that they feel that we're contributing again on this front in terms of feedback. But I think there's some work to do here in terms of how to standardize these workflows.
So to get into the nuts and bolts, basically, data coral provides a CLI tool. You know, you run data coral and then a command like data coral organize, and then you can say, map view, materialize view, create, and then you specify a path to a file, a a dot DPL file, data programming language file that's essentially a SQL a SQL command and with some comments and some annotations on the top of it. And let's say, you know, what kind of a materialized view it is and what what is the frequency with which it should be refreshed and so on. And so that SQL file is kind of, like, the source of that is the transformation that's going to happen. And so in a naive world, you'd basically have people just write, you know, write some SQL and then run-in through the CLI and and create these map views. Obviously, we want to be doing code review. We want to have you know, we want, you know, someone if someone's gonna create a new materialized view, we want someone else to approve it, and we want it all under version control. So we have a single Git repository called Datacoral that contains all of our materialized views in a structure that makes sense to us. So we basically have the different schemas as top level directories.
So you can imagine, a schema roughly correlated with a kind of a use case. So you could say, like, analytics underscore CS, so, like, customer success analytics. All the materialized views that power the dashboards that the CS team uses, all all the materialized views there are in that directory, in that schema. But what we've had to do is write some, you know, make files or scripts to to kinda make the process a little bit more streamlined. And then I think there's the the risk that we we don't have any kind of CIC, you know, continuous integration or continuous deployment of these things. So we still have to run them manually. So even when we open a pull request, we've had to come up with our own process where we'll open a pull request, say, hey. I'm I'm gonna create this materialized view. Can someone, like, take a look at it? And once it's approved,
[00:48:26] Unknown:
then I use the CLI to deploy it. But there's no, like it's not being enforced and it's not being automated. And I'd love to get there, but I don't I think in some ways, like, there's some work for us to do, and in some ways, there's, work that Data Coral is doing to make this a bit more streamlined as well. Yeah. So, just to kind of give an exam yet another example of kinda how Greenhouse is helping us move forward on this. So 1 of the things that we have done right now is this whole kinda compile step. So earlier, people would be able to just kind of create these materialized views, and we would kinda automatically infer the dependencies and kind of generate the pipelines. But now, with these kind of DPL files that are all kind of in 1 kind of repository, When you're trying to update 1 of those materialized use, you should be able to kind of run a compile step that'll then tell you if there's anything downstream that might actually get affected. Right? Because we know what the data dependencies are. And, again, with Greenhouse kind of leading the way in terms of providing the right kind of use cases, we've been able to kind of get started on that compiled step. And then the idea would be that we'd actually provide, like, a CICD pipeline where you change kind of, that 1 materialize you somewhere in the middle of the DAG.
And then you should be able to, not only push it to production, but be able to kind of run it, in, like, a test mode from that node all the way downstream so that you know what the difference is gonna be like after the test has been, so after the change actually has been applied. So these are actually things that we are actively working on. And, and, again, with, Greenhouse,
[00:50:01] Unknown:
helping lead the way in terms of providing the right kind of use cases. And so as you continue to work with Data Coral and the data and Raghu, as you continue to work with Greenhouse, I'm wondering what you're hoping to see in the future in terms of the platform evolution
[00:50:18] Unknown:
or any plans that you have going forward to add new capabilities or capacity to data coral. Yeah. I think that that piece that we're just talking about is is 1 1 part of it, kind of the operationalization production icing in those words. Basically, like, making this whole process scale, and be discoverable and and just, you know, this the data warehouse is becoming its own production system. And with any production system, you want some sort of a staged approach to change management. You don't wanna just be doing it live. So what we just talked about with the kind of the the different tools that can help kind of stage a change to the data pipeline and and show that. That's a that's a big piece of of what I'm kind of looking forward to in the future. And I think the on the other end of the spectrum is is more, you know while SQL is an incredible way to express these, data transformations, I think there are some use cases where things are a little bit more complicated or you you might wanna do something a bit more advanced. I think Ragu will probably speak more towards this, but I think building more sophisticated data transformations with using the same system will be incredibly valuable.
[00:51:40] Unknown:
Yeah. So to basically add to, kind of what I had mentioned, 1 of the things that we are hearing from Greenhouse and other customers is that they'd like to move beyond, SQL to be able to kind of specify more complicated transformations. But we really like the whole set of abstractions that SQL provides around kind of explicit data dependency specification as well as kind of the abstraction around, just saying, okay, what do you wanna get done, not how. So we, have come up with this abstraction called the user defined table generation function. Again, this is not new. Query engines like Hive have had it for a long time.
But we have come up with a way where people can actually plug in their Python code to be able to do kind of much more complicated transformations, even do things like inference, batch inference, and things like that. And you should be able to plug that into, a data flow. And the data flow specification itself is done in SQL because, I mean, that's how we are able to infer the data dependencies that'll then, generate the data pipelines. So this is 1 of the features that we are actually, super excited about because that'll, hopefully allow data scientists to do a lot more than, just write sequel.
[00:52:56] Unknown:
And are there any other aspects of the work that you're doing at Greenhouse or the work that you're doing at Data Coral or the sort of interaction between the 2 companies that we didn't discuss yet that you'd like to cover before we close out the show?
[00:53:09] Unknown:
Yeah. Actually, 1 of the things that I think maybe, Aaron, you can talk about, kind of, the business requirement. And, Duvay, just to, kinda point out here, this was 1 of the big use cases that Aaron and I had talked about earlier, but we didn't kind of get to it in terms of, I guess, the big wins aspect. This is around GDPR. So, like, about, early last year, you know, just trying to figure out how they're gonna get GDPR compliant on their analytics warehouse. And 1 of the things, that was also the driving factors for the whole logical decoding aspect of, pulling data was to be able to deal with hard deletes of the data. And that is something that we were able to clearly provide on the collect side. But once the data was in the analytics database, we wanted to get to a point where we're kind of proactively anonymizing data so that it was actually very easy for Greenhouse to be able to comply with the right to be forgotten rule. So, when this kind of requirement came along from Greenhouse, we actually worked with them pretty closely to kind of get to a point where even data that was coming from their kind of APIs, from different tools like Salesforce and the Zendesk and Jira, we are able to actually anonymize all using the same kind of materialized view framework to then allow them to be kinda compliant to the right to be forgotten. Arun, you wanna add to, add to it? Yeah. No. I think,
[00:54:31] Unknown:
it was around May last year that this this was happening, and I I think that we, you know, we wanted to to do everything that we could to be compliant. And and I think that 1 of our big worries was, like, well, if if we collect all this data in our data warehouse and don't have, like, a easy way to to propagate the deletes, that would be a big exposure point for us. And so we wanted to make sure that we're able to to handle that. So we as as Raghu mentioned, we worked with, and I think when we started when we brought this up, this wasn't necessarily top of the road map for Data Coral. But I think as we spoke about it more, I think it was clear that it would be a big piece, for any company that, you know, wanted to remain compliant. And so we worked together to figure out, well, how do we move from the current implementation that we have to 1 that will be compliant? And so we were able to get there, and I think that that was, you know, a big win for us. And we were able to make our legal team happy in saying that we do, you know, we do comply and and propagate those those deletes.
[00:55:39] Unknown:
And anything else that we should cover before we close out the show? Yeah. I mean, I think I just wanted to,
[00:55:45] Unknown:
thank Aaron and, again for the patience that they've had as we have kind of grown. We believe that, it's a kind of, customer that, any startup could kind of dream of. It's been, like, a really great kind of experience kinda working with, you know, so kind of look forward to,
[00:56:01] Unknown:
working with them, closely going forward. Yeah. The other thing I, I I mentioned that we worked with Periscope Data as our BI tool, but that was a number of years ago. And and the current BI tool that we work with is Mode Mode Analytics, and I think we've been super happy with with mode as as kind of the kinda primary window onto our data warehouse. And this is actually an interesting use case that we have. Metabase is a open source BI tool, and so Datacorl provides a Metabase slice. And so, basically, meta, you know, Metabase is a a really kind of it's a pretty powerful BI tool where you can write SQL queries against your data warehouse, your Redshift instance, and and see the results in the browser, which is obviously what a BI tool does. But I think that what we discovered in rolling out Metabase is that it was a little rough around the edges for something that would span the entire company. And so we decided to keep Metabase as, like, an internal tool for our engineering team, for our data scientists to kind of prototype queries, but we decided to not roll it out across the entire company. And the other part of that is that right now, the way that Metabase is organized is that it it's kind of has relatively blanket access to our Redshift database. And so anyone who has access to Metabase, which is a small subset of people, have, you know, access to a lot of data.
And and and that's good and bad thing. It's good because it allows us, again, prototype some of these queries, but we obviously don't want everyone at the company to have access to every piece of data. And so what we've been able to do, again, using materialized views, is we transform subsets of data into specific schemas in Redshift, and those are the schemas that we give mode access to. And so another piece of this is that we don't, like, we don't transfer PII or PI to, the schemas to which Mode has access. So, again, like, that's kind of, like, another layer of security and compliance is that we are able to use materialized views to sanitize the data for more public by public, I mean, within greenhouse, of course, more public consumption.
And so that's the kind of another interesting use case that we've found to be a pretty big win for us here is that we can we can do that and, feel, sleep easier at night knowing that, you know, we don't have to have everyone act get have access to all the data, which is, I think, probably a worry for a lot of people.
[00:58:47] Unknown:
Yeah. Again, the fact that, Metabase is something that gets deployed inside of your VPC means that it is kind of only accessible through your VPN, kind of adds to that security.
[00:58:57] Unknown:
Exactly. But, again, that's, not everyone has access to our VPN. Like, the customer success team, you know, isn't getting you know, isn't logging on to the VPN, like, our engineering VPN. So that was another reason why we don't we didn't roll it out across the company because the technical hurdles were were too great. Alright. Well,
[00:59:18] Unknown:
again, thank you both for that. For anybody who wants to follow-up with either of you, I'll have you add your preferred contact information to the show notes. And so as a final question, I'd just like to get each of your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today. Aaron, starting with you. It's a good question.
[00:59:39] Unknown:
Yeah. I think, this is actually a piece that maybe I I would have touched on before, but, we didn't get into it. But I think the the thing we're seeing more and more is is having our customers demand real time data as opposed to or close to real time. And and so the idea of, you know, either providing that data through, you know, these data dumps, like, once a day or, you know, every hour are are becoming the idea that you could could transform and stream data in real time is is definitely something that people want, but I think the, you know, the tooling is still complex, you know, to get data out of a a production system and maintain it and and send it in real time. I think we're getting close, but I I think I'd like to see that. And I think that's something, again, that I've talked with, Ragu a lot about, and I think Data Coral is, you know, is thinking about it as well. It's like, how do we, how do yeah. For leveraging logical decoding to pull all the changes that are coming from our Postgres database and into our data warehouse.
Are there, kind of, efficient ways to hook into that to to transform it in real time? And, Raghu?
[01:00:53] Unknown:
Yeah. I mean, from my perspective, I just find kind of the biggest gap is in, complexity of the whole, kind of tool chain, that's there across, like, all kinds of, functionality that is needed for data management. And it's, still very hard. Even though there's quite a lot of options, it's actually pretty hard for any 1 company to be able to say, okay. Now I know exactly what the right end to end, toolkit needs to be like. We are doing a little bit to, help standardize, tooling for end to end data flows. But as more and more companies have lots more data and lots more kinds of use cases, it's I only think that there's gonna be a a problem that'll keep, increasing where there are, more and more options, and each 1 does, like, a small sliver of what you need. And now it's up to you to actually put them all together yourself.
And, I think that's,
[01:01:51] Unknown:
that's something that we are trying to put a dent into. Alright. Well, thank you both again for the time today to talk through your experiences. It's definitely valuable to get some sort of insight insight into the ways that different people are running their engineering and managing their data platforms. So I appreciate the both of you taking the time today, and I hope you each enjoy the rest of your day. Thank you. Thank you so much,
[01:02:17] Unknown:
Tobias.
Introduction to Guests and Their Backgrounds
Greenhouse's Initial Data Infrastructure
Challenges with Existing ETL Processes
Evaluating and Choosing Data Coral
Onboarding and Transition to Data Coral
Successful Use Cases and Wins
Current Workflow and Future Plans
GDPR Compliance and Data Security
Future of Data Management Tools