Replatforming Production Dataflows - Episode 116

Summary

Building a reliable data platform is a neverending task. Even if you have a process that works for you and your business there can be unexpected events that require a change in your platform architecture. In this episode the head of data for Mayvenn shares their experience migrating an existing set of streaming workflows onto the Ascend platform after their previous vendor was acquired and changed their offering. This is an interesting discussion about the ongoing maintenance and decision making required to keep your business data up to date and accurate.

Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $60 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Sheel Choksi and Sean Knapp about Mayvenn’s experience migrating their dataflows onto the Ascend platform

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start off by describing what Mayvenn is and give a sense of how you are using data?
  • What are the sources of data that you are working with?
  • What are the biggest challenges you are facing in collecting, processing, and analyzing your data?
  • Before adopting Ascend, what did your overall platform for data management look like?
  • What were the pain points that you were facing which led you to seek a new solution?
    • What were the selection criteria that you set forth for addressing your needs at the time?
    • What were the aspects of Ascend which were most appealing?
  • What are some of the edge cases that you have dealt with in the Ascend platform?
  • Now that you have been using Ascend for a while, what components of your previous architecture have you been able to retire?
  • Can you talk through the migration process of incorporating Ascend into your platform and any validation that you used to ensure that your data operations remained accurate and consistent?
  • How has the migration to Ascend impacted your overall capacity for processing data or integrating new sources into your analytics?
  • What are your future plans for how to use data across your organization?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:14
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at linode. With 200 gigabit private networking, scalable shared block storage and a 40 gigabit public network, you get everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads. They just announced dedicated CPU instances and they've got GPU instances as well. Go to data engineering podcast.com slash linode. That's l i n od e today to get a $20 credit and launch a new server and under a minute and don't forget to thank them. their continued support of this show. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media chronium Global intelligence, od sc and data Council. Upcoming events include the software architecture conference, the strata data conference, and pi con us go to data engineering podcast.com slash conferences to learn more about these and other events and to take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey, and today I'm interviewing Sheel Choksi and Sean Knapp about Mayvenn's experience migrating their data flows on to the Ascend platform. So she'll Can you start by introducing yourself?
Sheel Choksi
0:01:48
Sure. So my name is Sheel, Senior Director of Operations here at made it responsible for product engineering and our data analytics teams. Been at Maven for about five minutes. half years now. So pretty familiar with Maven, and happy to talk about our journey to us.
Tobias Macey
0:02:06
And Sean, you've been on previously to talk about your experience building ascend, but if you can introduce yourself again, for anyone who hasn't listened to that episode,
Sean Knapp
0:02:13
I have, I'm really excited to be back. I'm the founder and CEO here at ascend. We're a roughly four and four year old company, and I'm happy to chat more about what we're doing, but frankly, even more excited about everything that mavens been doing.
Tobias Macey
0:02:31
And so, going back to you shield, you remember how you first got involved in the area of data management?
Sheel Choksi
0:02:36
Yeah, sure. So, you know, I think, like most startups, we started with the engineering team. And at that point, even early on, we realized that it may been we wanted to collect a lot of this data that a lot of companies might deep transit. So you know, Maven is a half ecommerce company and half sort of gig economy and service matchmaking company. And so you know, an example from that might be your traditional or table, you know, with the line items and in order, even back then we realized that, you know, rather than just knowing the current state of the line items in the order, we'd rather know which items were added removed. And when they were added and removed to sort of get a better sense of what our customers really up to. So even back then, when we were just an engineering team, we started sort of managing our data from a standpoint of if we can at least hold on to everything. And you know, the way storage prices keep getting cheaper and cheaper, will be ready as the company keeps expanding, as kind of the company did in fact, keep expanding. That's when we started bringing on an actual dedicated data team. And so I was one of the teams that the company asked me to build. So kind of how I got involved a little bit of outside of engineering, but also kind of that transition period from you know, all the engineering teams collecting of data to our actual data teams usage of the data.
Tobias Macey
0:03:54
And Sean, can you share your experience of how you got started with data as well,
Sean Knapp
0:03:58
my first experience with data was Pretty quickly out of college back in 2004, which feels like a really long time ago, I was a front end software engineer at Google. And we ran a lot of experiments on web search, I think, you know, similar to sheilo had described, we collected a ton of data to really analyze how users were engaging with web search. And so as we were running a large number of experiments on web search, and I think there was over 100 per year that we were running users through, we were doing large MapReduce jobs to analyze the results of those experiments. So I started writing a lot of MapReduce jobs in the internal language called Saul's all back then to do metrics and track the efficacy of our various experiments.
Tobias Macey
0:04:43
And she'll you mentioned that Maven is a combination of ecommerce platform and gig economy marketplace. So I'm going to give you can just give a bit more of a description about what the company is and some of the types of data that you're dealing with and some of the ways that you're using it.
Sheel Choksi
0:04:57
Yeah, definitely. So you know, and maybe been around for a little while. And so the model that I described is a newer one that we started about a year and a half ago. And so the products that we primarily sell it made in our hair extensions, so clip ins, taipings wigs, and your traditional sewing type, bundled hair, all of these products typically require professional stylists to help you install them after you purchase it. So what Maven does is we combo those two sides of it together. So what you can do is you can come to the Maven website and purchase the hair, that's a traditional e commerce experience. And then after you finish your purchase will match make you with a one of our service providers. We've got thousands of stylists spread out all across the United States who will actually do the service for you at a discounted price. So you know, the first half of that is sort of e commerce. And the second half of that is what we look at is sort of a gig economy with these independently contracted stylists. And I think the second part of your question, there was sort of what we're up to in terms of data collection. So on the e commerce side, it's a lot of the standard ecommerce things that you might expect. So on the back end, it's orders and shipments and fulfillment and keeping track of fraud and things like this. And then on the front end, it's a lot of clickstream data. So, of course, page views add to cart, email captures everything that we need to understand the customers. And then I think where the data gets more interesting is marrying all of that together with the services side of what we're up to. So that includes like stylist directories and how we surface the best stylist collecting more ratings and reviews for our stylists, getting that feedback, and then just a lot of operational things as well. So stylists can get sort of paid out on time and all of that. So there's sort of three levels of data groupings. There's the consumer, the stylist, and then internal operations,
Tobias Macey
0:06:44
and in terms of the biggest challenges that you're facing, and dealing with collecting that data and then processing it and gaining value from it. What were some of the most notable ones, particularly as you first started to build out that data organization within Maven and trying to get your head around what it is that you had and what you were trying to do with it.
Sheel Choksi
0:07:05
Yeah, definitely. So I'd say it's actually similar to most of mavens problems being, you know, kind of a startup is most of the problems tend to be of surface area. So not necessarily super deep into dealing with, you know, petabytes or exabytes of data. But more just dealing with the wide variety of it. You know, as I mentioned, clickstream data, you know, that has its own unique properties of, you know, maybe some to e to IP or user agent parsing. When we're thinking of sort of the backend data, it all has its own schemas. And then, you know, since obviously, we're not building everything ourselves. You know, I think last I checked, there's probably more than 20 external vendors, who we use that also have their own data. And so figuring out basically, okay, I've got all these different rich sources of data. How do we sort of make that into meaningful, analytical, ready kind of version of that data? is really I think where our problems are.
Tobias Macey
0:08:03
And before you started migrating onto the Ascend platform, can you give a bit of an overview of the overall infrastructure and sort of systems components and some of the vendors that you are working with to try and be able to wrangle the data and get some valuable insights from it?
Sheel Choksi
0:08:18
Yeah, sure. So internally, in our Maven sort of engineering systems, everything is modeled as basically a giant event driven application. So we use Kafka pretty heavily for that in terms of events going through, and then applications can react to events. So simple examples might be, you know, an order placed event that fulfillment provider might react to, and then, you know, decide to fulfill that order. So internally, we've always kind of had that which sort of leads to almost an extreme convenience into just kind of dumping more and more events in, you know, whether that means a vendor who integrates through web hooks, or you know, just fetching some API report and then translating those into events. It sort of allows for bringing in more event Very quickly, which is great on the ingest side, but I think on the kind of what I was referring to is the analytical ready side, that's where I think you kind of have to deal with this variety of data and all sorts of different schemas and all of that. And so in our original architecture after it all kind of went through Kafka, that's when we use this vendor called the Luma, who I think got acquired by Google last year. And they helped us sort of detangle all of these events, decide which one should go to which tables, we use redshift for, mainly for analytics. So mapping those to redshift tables, whether that's a one to one one too many, or many to one in kind of a light, very light transformation, I'd say. So it was definitely a kind of an eel tea shop, in that sense. And, you know, that was kind of it. The goal was really just Okay, we've got all this data flowing. Can we just get it mapped to redshift so that we can then do kind of the rest of our transforms and then start building up the analytical tables that we want both for ad hoc analysis or like our data scientists to use or for Look at reporting, that sort of thing.
Tobias Macey
0:10:02
And were you using anything like DBT for being able to manage the transformations of the data once it landed in redshift, and being able to have some sort of versioning and source control over how you're actually manipulating it from the point of landing it in your data warehouse and whatever other data storage systems you have for and then through the different views that you're trying to get across it?
Sheel Choksi
0:10:24
Yeah, so we actually ended up going super lightweight there. So we started with Amazon's data pipeline, and sort of using it, you know, I had like redshift activity nodes and single transform nodes and that sort of thing. So we started super lightweight, and just kind of manually manage the actual source files and GitHub. So I'd say it was pretty sticks and stones kind of trying to assemble that together. You know, that was one of the things that we sort of noticed was that because the initial tool we were using illuma is more of a stream based tool and more setup for Lt. Basically, we were kind of on our own for the that last transport That's one of the pain points that we wanted to fix this kind of going forward.
Tobias Macey
0:11:04
And so that brings us to the point of deciding that you were at a tipping point where you didn't necessarily want to keep duct taping together a number of different solutions to be able to try and gain some insights from the data that you do have. And I'm wondering what the final straw was that actually led you to decide that you needed to do a major re architecture and what the criteria were for deciding what the components were or what the new direction needed to be for being able to take all the different types of data that you were working with.
Sheel Choksi
0:11:38
Yeah, sure. So the the final straw was, well, not not as intentional from the Maven site, as I wish I could say it was but basically, after Luma got acquired by Google, they were sort of ending redshift support us and that kind of left us without really a vendor in place. But what I will say is even outside of that final straw, some of the other big pain point that was happening is Because it was sort of a stream based transform, you either kind of transform it at that moment, or you kind of let that data go. You know, obviously, there's restreams, and that sort of thing, but it wasn't super easy to do. And so it led us to this kind of quarter mentality, in that, you know, transform everything, keep everything because this is the shot, whether or not it's actually useful whether or not anybody even understands what this column means or what this event means. And let's just hold on to it. Let's just get it in redshift, so that we have it, which sort of inevitably led us to this sort of extremely bloated, both illuma and redshift with tons of tables and tons of columns that didn't necessarily have a lot of business value, and added to just kind of clutter and confusion. You know, when we were adding new people onto the team, there was just this enormous ramp up of like, what is all of this stuff? What do I actually need to use? Should I use this table or that table they both look similar? And so I would say that's the other major one that we were kind of looking to solve. Once we realized that we didn't have a And so that kind of helped frame our perspective of kind of what we were looking for in the search. You know, as we kind of started looking through vendors, we found that at a rough level, we could kind of group them into three categories. One, we kind of called the stream based category. So that was sort of the illumines I think there's a similar one called hebo data these days, kind of bucketed those all into the stream based vendors, then we found kind of the, the kind of the more declarative vendors, that's what we kind of consider ascend, which is sort of, you know, kind of more describe what it is that you want, and not necessarily so imperative as like, we're going to take this event, we're going to split it this way, in that way. And so, you know, some others that we found that we bucketed in that category might be like a, like an upsell for folks like that. And then the kind of the third category that we saw was more of these kind of either legacy one and that legacy, I should say, but, you know, maybe some of these older generation ones, you know, like the Pintos and and then the this very specific like that. You based what have, you know, maybe like a stitch data or five trend where, you know, they're very opinionated and sort of inform how exactly this is all going to go. And, you know, kind of we were used to the flexibility that illuma provided us, you know, they allow you to write kind of arbitrary Python. And so we knew we didn't want to go in that direction of maybe one that didn't allow us to be very flexible, you know, such as, like a five trainer, that sort of thing. And so that sort of left us like, you know, we could do a one to one stream based migration, or, you know, we could take a look at some of these newer providers and what they're up to. And as soon as we started looking at these kind of newer providers, ascent, obviously being our final choice there, we immediately sort of just saw that value add, and how I summarize that value add is it sort of made change. So as I mentioned in illuma, if all of a sudden we didn't map a column, and you know, we need to go back and get that column that was important. You know, that was just really expensive. You had to wait a while you had to go figure out how you're going to go get the old data again, you had to wait for it to slowly one event at a time kind of restream in you had to deal with the deduplication problem at the end. You know, it was just sort of a long Sort of piecemeal steps that sort of made change expensive, right. But as we started looking at these newer providers, that's what we saw as the value add is we didn't need to kind of hoard all the data exactly as we needed it as redshift, but rather, we could treat redshift exactly as more it was meant to be us. Let's prep these tables for how we want them. But let's bring in the columns that had meaning to us. And let's not fear that, you know, maybe we made a mistake, or more importantly, you know, data will change schemas will change, let's be able to adapt.
Tobias Macey
0:15:26
So, Sean, from your perspective, I'm curious what your experience was, and at what stage you were when she'll first came to you and wanted to trial, the ascenta platform with the workload that he had at Maven, and any aspects of the types of data that he was working with or the variety of data that posed any sort of challenge for your, for the state of the Ascend platform at the time and how that may have helped you in terms of determining your product direction?
Sean Knapp
0:15:54
Yes, really good question. And, you know, I think a couple of the things that are Sheila mentioned really resonated really quickly with us when we saw a lot of their use case which Well, I think first was when they were notified that they had to find another solution. And we're being forced into this pattern and this reevaluation very quickly. We of course, love because it it highlights the need and the requirement that data teams have to move quickly. And so that was actually I think, and in show you should correct me if I'm wrong, but I think it was something like a week or two from when you had first read about us all the way to like where you'd gone into the free trial, built some stuff contacted the sales team and and had essentially signed a contract with us, which is super cool, but but highlighted how fast you have to be able to move in in sort of the modern era. And I think the the the other thing about their use case that that really resonated a lot was around this need for flexibility right You have all these really interesting data sets coming in there, the variety is incredibly high. And she'll hit I think the point and the nail on the head, which was, you don't know what you're going to need yet, right? So you this, you start to get this hoarder mindset of I'm going to go and keep everything but you're trying to find that balance of you don't want to go jam it all into your data warehouse, because that's a an anti pattern and just abuses your data warehouse and clogs it up. But you need that flexibility to go create new derived insights and drive data sets on top of other data sets down the road. And so when we started to see a lot of those patterns emerge it one it's one of the things that we had seen many of the people at sand had seen earlier on in our careers, which really drove us to create a send that to it. We think it's a pattern that a lot of teams encounter. Now, as far as the I think the the other part of your question, yes, as far as product extensions or new features that we had built, I'm not sure I'm sure there's something that we may have needed to add on top of Sure, you could probably even better than I could. I'm
Sheel Choksi
0:18:03
sure there were a couple. Maybe it helped jog your memory. One, I think was figuring out how we were going to do exactly this geoip parsing kind of in a declarative batch based world. And it's a lot easier obviously, to think about an extreme but a little trickier to think about and be efficient in a batch. That was one. And I think that sort of extended itself into sort of joining partitions. So if you want to talk about any of that,
Sean Knapp
0:18:25
oh, yeah, thank you so much. That was one of the really cool, exciting ones that we did add. So part of this is in a declarative model, you're really defining the end state, right? You're not saying hey, run this block of code every day at this time. And it's really interesting when you combine something like that with IP parsing, because that the assignment of an IP address to a particular geography is a time specific thing. And so one of the features that we added was this ability to do really optimized joins of data that are time specific. And so the in a declarative model if you change code or if you change the definition, the system itself will start to recalculate those definitions based off of the new the new model. But one of the things that you don't want to do is recalculate stuff that is based off of a snapshot in time from yesterday or week before, even a month or a year before. And so one of the things that our team obviously had a lot of fun going in implementing was this, this deeper level of time awareness. So you could do these really advanced joins across different data sets and different partitions of data in those data sets that are more time aware and sensitive to that.
Tobias Macey
0:19:36
Yeah, that's definitely an interesting challenge of being able to reprocess your data in a batch manner for data that, as you said, is subject to change at any particular point in time and geoip. Being an interesting case of that, where a lot of people might think about it as being some somewhat static, but particularly for the ipv4, it's something that gets reassigned depending on who's bidding on those IP blocks. And so what Today my point to Wisconsin tomorrow my point to New York, because that's just how maybe Amazon allocates them, for instance. So that's definitely an interesting approach and something that I hadn't really thought of before.
Sean Knapp
0:20:10
Yeah, it's definitely one of the ones that in the imperative model, that sort of a use pattern works well. And so that was the fun part in the declarative model is creating this, this awareness right in a declarative system, you always have the the underlying control plane is the thing that's dynamically generating activity work for you until you you're constantly adding more intelligence into that control plane, similar to how you would a query optimizer or a query planner for a database, and to putting in these continued optimizations into that engine. So it can get smarter and smarter. based off of the declarative model, the blueprint that you create is an area of really interesting just continued research and development, frankly.
Tobias Macey
0:20:54
And Sheila, I'm curious if there were any other edge cases that you ran into as you were migrating on to it. And that were easy in that stream based approach, but became either difficult or impractical, or you just needed to think about a different way of approaching it in this declarative model.
Sheel Choksi
0:21:10
So I think the other one is sort of the taking the latest version of something. So I think if you have truly just an Event Stream, whether it's in part of imperative or declarative, that's sort of not an issue, because, you know, maybe they're all append only. But when we start getting into actual like resources, you know, maybe users and just kind of wanting the latest state of the user, that's definitely a little bit of a different thing to think about. In imperative versus declarative and imperative, it was as simple as, you know, just kind of keep a pending the rows and then as we duplicate data, just take the latest one, whereas in a declarative approach, it's much more of well how you would think about it in sequel really, which is, you know, let's try to row number them across and get the earliest version by occur dad or something like that. It feels a little bit different, and it feels maybe a little bit more complex, but I think, under the hood, it's kind of the same idea. Regardless, it's just sort of Are you doing patch set updates? Are you kind of redoing a bunch, and I think that is one of the things that at least on my team here, that took a little bit of our time sort of wrapping our heads around is that ascend finds this happy medium in between redoing everything to find the latest. And you know, just that patch set between that very last event and the previous one. It's sort of how it treats partitions. It's really quite elegant the way it's done, but it took our team just sort of a little bit of a mental swap in switching approaches just to understand how that works.
Tobias Macey
0:22:31
And obviously, you ended up having to retire the dependency on illuma because of the fact that that's what pushed you on to ascend. But now that you've been using the Ascend platform for a little while, I'm curious if there are any other aspects of your data infrastructure and overall architecture for your data platform that you have been able to rethink or re implement or even completely retire as a result?
Sheel Choksi
0:22:54
Yeah, quite quite a few. So, you know, as we kind of talked about, we had kind of patched a bunch of different things together. So, you know, outside of the Event Stream that where most of our data comes from, you know, we have a decent number of basically, API fetchers, if you will, and they just kind of need to go to like the Facebook API, grab our latest ads data, or, you know, go to our reviews platform and grab the latest reviews, that sort of thing. And so we were sort of managing that ad hoc in data pipeline. So you know, kind of some custom coded scripts that run a data pipeline that fetch that data, and that's one that we've been able to retire and thankfully, so ascend has sort of custom connectors. And so you can sort of write Python scripts that are both run and managed by ascent. And so we were able to convert those all into that. So that's gone. Then on the actual illuma side, we covered that, and then from when the data goes into redshift, you know, as we talked about, we had a bunch of additional data pipeline jobs that, you know, might retransform that data using redshift, and so we've been able to start extracting those into our Actual ETL pipeline. So you know, instead of it being more of a pure ELT, it's a little bit more of an actual ETL. Now, and so we don't necessarily need to run these expensive scripts once a day in redshift, which sort of has led us to two things. One is that, well, you know, the ratio doesn't randomly spike once a day. And then the second is that the frequency that the data is available to both our data team as well as everybody else within David has actually gone up, you know, because we can do this on sort of a higher frequency. And so people are getting fresher data, people are reacting a little bit faster to, you know, maybe an ad that's not performing as well as it used to, and that sort of thing. So we've been able to take out all of those components with this migration.
Tobias Macey
0:24:35
And I'm wondering what the overall process was for making that migration and what the timeline was that you had available for being able to cut things over. And then finally, as part of the overall migration, I'm wondering what steps you made to perform any sorts of validation or testing to ensure that the results that you were getting before and after were still consistent and that you were able to ensure That you either weren't missing data or that the transforms weren't hitting some sort of bug that was causing a variance in what the overall results were of the analysis that you're performing.
Sheel Choksi
0:25:10
Yeah, sure. So we dedicated, you know, quite a bit of resources in order to make sure that this migration is successful. So, you know, on our data team is four people. And we pretty much used about two thirds of everyone's time, full time for about four to six weeks, I would say to cut over. And so you know, I think that includes not only just sort of one to one mapping of illuma, but this additional transforms that I was talking about that were being performed elsewhere sort of moved into a set as well. So that's kind of what the project timeline and bandwidth looked like, you know, as we keep going, of course, there's just additional resources required as new events come in, and that sort of thing, but it's, it's going a lot quicker than illuma was. And then I think the second part of your question, there was sort of, you know, in the migration, how do we know that, you know, we're ready to go live and that the data is accurate. So you know, I think in any migration, we tried to follow some best practices where we tried to keep most of the kind of base tables in a one to one style so that we could more easily audit. So instead of trying to bite off more than we could chew in one transform, given such a powerful tool, we still kept it fairly basic, we didn't really get too fancy with doing too many joins a row operations or anything like that, that we weren't doing in illuma, just because we could do that. So that helped with the sort of the QA and the validation quite a bit since it was much more of an exercise in you know, I have this many rows over here, I should have this many over here, you know, these columns kind of look like this, they should look like this. And so it was a lot of that basic kind of profiling of our data and making sure that that profile still match, you know, important things like making sure that the revenue for a particular month was still the same revenue, that kind of stuff. And so that helped us quite a bit and then you got into the sort of the nitty gritty of what you mentioned, of like, you know, maybe particular transformations and subtle sort of logic errors. And you know, I think one of the things I like about ascend and kind of spark which underlies, it is That, you know, in general, if the, if there is an error, it tries to push it down as far as possible and keep things sort of as I can know. And that makes it a lot easier to audit and verify it, because then instead of dealing with like mysterious errors in your ETL pipeline of like, you know, line 46 cannot call this function of know, and trying to figure out which row that might possibly relate to in your massive data set of 10 billion rows, you know, you actually have it all in your database already. It's not perfect, but it's, it's somewhere that it's querying. And so that was really nice in this as well is that, you know, at the end of the day, you know, what data folks are most familiar with is query about data. And so even if the email isn't perfect, at least the fact that it made it all the way through and his query double, very quickly, let us sort of diagnose any issues, figure out which rows were having that problem, you know, kind of just using queries and then fix that up very easily just change that script, in a sense, let us and rerun it does it sort of automatic, if you will, and then sort of just check it again, in our redshift, which sort of gave us a value added a different feature that I said has as well which is in each step of The transformation ascend lets you write SQL queries to kind of query your data that turned out to be hugely valuable in this validation process as well, because not only could we then query the redshift to see what went wrong, but we could then kind of sort of rerun like a modified version of that query at each Transform node, and then really track down where it went wrong. You know, and especially in a more complex transformation with several steps,
Tobias Macey
0:28:22
and now that you have been using a sand in production for a while, you've been able to replace some of your pre existing infrastructure and optimize some of the overall query patterns and ETL jobs that you're building. I'm wondering how you have been able to use the additional capacity that you have both in terms of compute infrastructure and scalability, but also in terms of human capacity for being able to either onboard new data sources to gain new insights from those or take new looks at the data that you do have or integrating them into your overall analytics.
Sheel Choksi
0:28:54
Yeah, definitely. So I would say time for change like either the amount of time it took to get A new data source in or update an existing one has sort of gone down from what was it days, one or two days worth of work down to honestly less than an hour where someone can very quickly like kind of propose a change, somebody else can, you know, take a look at it. And by the time it goes live, and ascend sort of scales up and replays, everything in an hour to that data is already kind of mapped to database. So that has been markedly different, I would say, we used to spend one fourth of the team's time just trying to keep up with all the new data sources and changing schemas of given events and that sort of thing. And that's decreased to Oh gosh, I was a pretty much negligible at this point. Because, you know, again, in illumos is sort of a hoarders version, right. And so you had to keep up with every single schema change every single event change, whereas now it's sort of more pull based. So if you want new columns, you can go explore, you can go find them, and you can go bring them in very quickly. But if you're pretty happy with the way things are, you don't have to keep up with every single schema change. So that has added up to a huge amount of human savings within mavet, which then just translates to, you know, more of the actual projects that we need to be doing at Maven. You know, Maven is obviously not a data infrastructure company. And so that's just not where we want to be spending our time. We want to be spending our time, you know, finding the best stylists and matching them with our customers. So you know, that's what that all translates to,
Tobias Macey
0:30:21
and to your point about the time from identifying a data source to getting it active in the platform, I'm curious what the difference in the collaboration style or the overall workflow looks like between where you were to where you are now.
Sheel Choksi
0:30:36
So I mean, I think in that sense, it's similar, you know, both of them are code based. So either illuma, you're writing Python code, and in a sense, you're sort of either writing pi spark or SQL code in so you know, given that its code, obviously, we've always had somebody to review any of those changes. So that part sort of remains the same. What I will say though, is our data team is made up of you know, analytics manager. data scientists, you know, people just starting to learn that, you know, kind of different people are in different places and ascend has allowed us to democratize more of that process. So when it was all written in Python and illuma, you know, the number of people who could actually review a change that was going into the stream was quite few, it was somebody who had to understand both Python unit testing and Python, you know, kind of both of those in order to be able to look at it, and then also have a good understanding of the schema and everything that might be related to it in the Ascend version of this though, because it's a sequel or pi Spark, and, you know, a lot of pi spark looks and feels like sequel abstractions. You know, basically, our whole team is now able to participate in this process. That used to be just a couple of folks. And so that has actually been better collaboration style.
Tobias Macey
0:31:46
And are there any other aspects of your experience of migrating the overall data platform either anything specifically about ascend or the process of identifying new platforms or re architecting your systems or from years Side shot, anything that we didn't discuss from your end of bringing Maven on board and working with them to identify the optimal way to take advantage of your platform. any of that, that we didn't discuss yet, either you'd like to cover before we close out the show?
Sean Knapp
0:32:14
Well, so you know, one of the things that I would certainly double down on that scheels highlighted is, you know, I think the the new metric, really, progressive data engineering teams is no longer the how much data can you ingest? Or how fast can you process it? Or how much of it can you store? But I think it's really changing tune now. How fast can you move? How many new insights how many new data products can you produce? And frankly, a lot of that it comes down to how can you really minimize and optimize the the maintenance burden on teams so that you can really focus the lion's share of your efforts on new and renewed creating new value. And I think that's what's really exciting and working with sheelane. The team has just been their laser focused on that and I think totally embody that. new focus from a leadership perspective of how do we just enable teams to go faster?
Sheel Choksi
0:33:06
Yeah, the only thing I would add is sort of a tangent from that. But something that our engineering team has actually brought up since they've been working with us. And a little bit more is that, you know, I think there's a joke in engineering that so many problems are really just taking some data, transforming it and pushing it somewhere else, whether that's through API or batch transformer, you know, some SFTP type situation. And so when you really take like a bird's eye view of ascend, there's a lot of problems that would have to be solved with kind of custom code that can be done really well in a platform as flexible as this. And so we haven't gotten here yet. But the engineering team has a lot of interest in using this and platform a, you know, use cases outside of maybe this ETL. So an example is, you know, maybe with one of our vendors, we need to upload, you know, maybe some basic order information over the past month, in order for them to do some reconciliation. You know, of course, you could write an app to do that, but then you take a step back, and your Like, well, this ETL pipeline is already doing all of that work. Couldn't we just do another transform? Could we just do another right output that could handle all of this? And so, you know, I think I don't know if this is necessarily something that a cent finds interesting. But it's something that a lot of people within Maven are talking about as a way to structure things a little differently.
Tobias Macey
0:34:17
Yeah, that's definitely interesting seeing some of the ways that the lines are blurring between data infrastructure and data engineering and data science and application infrastructure and architecture and how the tools are becoming flexible enough that they're actually starting to bridge the divide where they can be used in all of these different contexts so that you can leverage the underlying platform without having to have these bright lines that separates what is being used for which aspects of the business. Exactly. All right. Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing. I'll have you each add your preferred contact information to the show notes. And as a final question, starting with you, Sheila. I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
Sheel Choksi
0:35:00
It's interesting, I think we're still in a place of so many tools. I think what's happening in the data industry is that, you know, kind of reminds me of JavaScript and front end development. There's a lot of problems that are very quickly being solved, which sort of leads you to Okay, first, we'll grab the data from here, then maybe it goes through Kafka, then maybe it, you know, goes through this ETL vendor, then maybe redshift Oh, no, our redshift is getting full. Let's offload that to you know, s3, it's like, oh, wait, but we want to query s3, let's bring that back into Athena or redshift spectrum and, you know, tie that together. Oh, but we want to extract that and then run data science on it. So let's run EMR jobs on it. And, you know, I think where we're at is it's just a massive proliferation of all of these tools. And it's a lot to keep up with in the data space, in terms of maybe what somebody is up to. And so, I think, these higher level orchestration tools and higher level coordinations that, you know, start abstracting away these details of necessarily exactly how things are done and which exact technology reusing their I think it's going to become valuable to allow people to be more productive and not necessarily have to be experts at every single tool.
Tobias Macey
0:36:06
And Sean, do you have anything to add as far as your perspective on the biggest gap that you see in the tooling or technology that's available for data management today?
Sean Knapp
0:36:14
Yeah, I'd say I'd, I'd certainly build, reinforce and build on top of what Sheila is saying, which is, I think we have a lot of really interesting technologies that are designed for point solutions. And they make it really easy to prove out a concept or build your first pipeline or write your first query. But the tools I think, are really geared today for how do you do a great Hello, world, enter the first use case? And I think there's been less time spent on how do we make sure that these systems are incredibly low maintenance burden, so much so that I think it's a bit of a badge of honor to say you're a data engineer, because everybody, it's like, oh, yeah, you have to go deal with all these things that break all the time and it creates a certain bond among All of us. But I do believe that similar to what we've seen in other domains and technology spaces, like infrastructure, for example, that there is the ability to build these higher abstraction layers that as a result, have a better understanding of all the underlying pieces. And via simplification via better modularization of capabilities and data models and so on, can actually automate away a lot more of the underlying pain points that make our job as data engineers cleaner and simpler so we can focus on more of the new data product creation tool. So hopefully we continue to see this as an increasing trend in the years to come.
Tobias Macey
0:37:43
Alright, well, thank you both very much for taking the time today to join me and discuss your experience of working together and migrating mavens data platform on to a new infrastructure that has enabled them to move faster and gain more value of the information that they're collecting. It's always interesting to be able to get some in Side perspectives on these types of operations. And so I appreciate both of your time and efforts on that and I hope you enjoy the rest of your day.
Sheel Choksi
0:38:06
Thanks so much for having us. Thank you.
Tobias Macey
0:38:13
For listening, don't forget to check out our other show podcast.in it at python podcast.com. To learn about the Python language its community in the innovative ways it is being used. visit the site at data engineering podcast.com Subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects on the show, then tell us about it. Email hosts at data engineering podcast.com with your story, and to help other people find the show. Please leave a review on iTunes and tell your friends and co workers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!