Summary
Building a reliable data platform is a neverending task. Even if you have a process that works for you and your business there can be unexpected events that require a change in your platform architecture. In this episode the head of data for Mayvenn shares their experience migrating an existing set of streaming workflows onto the Ascend platform after their previous vendor was acquired and changed their offering. This is an interesting discussion about the ongoing maintenance and decision making required to keep your business data up to date and accurate.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Your host is Tobias Macey and today I’m interviewing Sheel Choksi and Sean Knapp about Mayvenn’s experience migrating their dataflows onto the Ascend platform
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start off by describing what Mayvenn is and give a sense of how you are using data?
- What are the sources of data that you are working with?
- What are the biggest challenges you are facing in collecting, processing, and analyzing your data?
- Before adopting Ascend, what did your overall platform for data management look like?
- What were the pain points that you were facing which led you to seek a new solution?
- What were the selection criteria that you set forth for addressing your needs at the time?
- What were the aspects of Ascend which were most appealing?
- What are some of the edge cases that you have dealt with in the Ascend platform?
- Now that you have been using Ascend for a while, what components of your previous architecture have you been able to retire?
- Can you talk through the migration process of incorporating Ascend into your platform and any validation that you used to ensure that your data operations remained accurate and consistent?
- How has the migration to Ascend impacted your overall capacity for processing data or integrating new sources into your analytics?
- What are your future plans for how to use data across your organization?
Contact Info
- Sheel
- Sean
- @seanknapp on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Mayvenn
- Ascend
- Google Sawzall
- Clickstream
- Apache Kafka
- Alooma
- Amazon Redshift
- ELT == Extract, Load, Transform
- DBT
- Amazon Data Pipeline
- Upsolver
- Pentaho
- Stitch Data
- Fivetran
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform. If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances, and they've got GPU instances as well.
Go to data engineering podcast dot com /linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Pareteum Global Intelligence, ODSC, and Data Council.
Upcoming events include the software architecture conference, the Strata Data Conference, and PyCon US. Go to data engineering podcast.com/conferences to learn more about these and other events and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macy. And today, I'm interviewing Shiel Chokshi and Sean Knapp about Maven's experience migrating their data flows onto the Ascend platform. So, Shiel, can you start by introducing yourself?
[00:01:49] Unknown:
Sure. So my name is Shiel, senior director of operations here at Maven, responsible for product, engineering, and our data analytics teams. Been at Maven for about 5 and a half years now.
[00:02:02] Unknown:
So pretty familiar with Maven and, happy to talk about our journey to Ascent. And, Sean, you've been on previously to talk about your experience building Ascent. But if you can introduce yourself again for anyone who hasn't listened to that episode.
[00:02:14] Unknown:
I I have. I'm really excited to be back. I'm the founder and CEO here at Ascend. We're a roughly 4, 4 year old company and, I'm happy to chat more about what we're doing, but frankly, even more excited about everything that Maven's been doing.
[00:02:31] Unknown:
And so, going back to Ushiel. Do you remember how you first got involved in the area of data management? Yeah. Sure. So, you know, I think as like most startups, we started,
[00:02:41] Unknown:
with the engineering team. And at that point, even early on, we realized that at Maven, we wanted to collect, a lot of this data that a lot of companies might deem transient. Maven is a half e commerce company and half, sort of, gig economy and service matchmaking company. An example from that might be your traditional orders table, you know, with the line items in an order. Even back then we realized that, you know, rather than just knowing the current state of the line items in the order, we'd rather know which which items were added and removed and when they were added and removed to, sort of get a better sense of what our customers are really up to. So even back then when we were just an engineering team, we started, sort of, managing our data from a standpoint of if we can at least hold on to everything and, you know, the way storage prices keep getting cheaper and cheaper, we'll be ready as the company keeps expanding.
As kind of the company did, in fact, keep expanding, that's when we started bringing on an actual dedicated, data team. And so, it was 1 of the teams that the company asked me to build. So kinda how I got involved. A little bit of outside of engineering, but also kind of that transition period from, you know, all the engineering teams collecting of data to our actual data teams usage of the data.
[00:03:54] Unknown:
And Sean, can you share your experience of how you got started with data as well? My first experience with data was pretty quickly, out of college back in 2004,
[00:04:04] Unknown:
which feels like a really long time ago. I was a front end software engineer at Google, and we ran a lot of experiments on web search. I think, you know, similar to Shiel had described, we collected a ton of data to to really analyze how users were engaging with web search. And so, as we were running a large number of experiments on web search, and I think there was over a 100, per year that we were running users through. We were doing large map reduce jobs to analyze the results of, those experiments. So I started writing a lot of MapReduce jobs, in the internal language called sawsall back then to do metrics, and track efficacy of our various experiments.
[00:04:44] Unknown:
And, Sheel, you mentioned that Maven is a combination of ecommerce platform and gig economy marketplace. So I'm wondering if you can just give a bit more a description about what the company is and some of the types of data that you're dealing with and some of the ways that you're using it. Yeah. Definitely. So, you know, and, Maven's been around for a little while. And so the model that I described is a newer 1 that we started about a year and a half ago. And so the products that we primarily sell at Maven are hair extensions.
[00:05:10] Unknown:
So, clip ins, tape ins, wigs, and, your traditional sew in type bundled hair. All of these products typically require a professional stylist to help you install them after you purchase it. So what Maven does is we combo those 2, sides of it together. So what you can do is you can come to the Maven website and purchase the hair. That's a traditional e commerce experience. Then after you finish your purchase, we'll matchmake you with 1 of our service providers. We've got thousands of stylists spread out all across the United States who will actually do the service for you at a discounted price. You know, the first half of that is sort of e commerce and the second half of that is what we look at as sort of a gig economy, with these independently contracted stylists. And I think the second part of your question there was sort of what what we're up to in terms of data collection. So, on the e commerce side, it's, a a lot of the standard e commerce things that you might expect. On the back end, it's orders and shipments and fulfillment and keeping track of fraud and things like this. On the front end, it's a lot of clickstream data. So, of of course, page views, add to carts, email captures, everything that we need to understand the customers.
And then I think where the data gets, more interesting is marrying all of that together with the services side of what we're up to. So that includes, like stylist directories and, how we surface the best stylists, collecting more ratings and reviews for our stylists, getting that feedback, and then just a lot of operational things as well. So stylists can get sort of paid out on time and all of that. So there's sort of, 3 levels of data groupings. There's the the consumer, the stylist, and then internal payment operations.
[00:06:45] Unknown:
And in terms of the biggest challenges that you're facing in dealing with collecting that data and then processing it and gaining value from it. What were some of the most notable ones, particularly as you first started to build out that data organization within Maven and trying to get your head around what it is that you had and what you were trying to do with it? Yeah. Definitely. So I'd say, it's actually similar to most of Maven's problems being,
[00:07:10] Unknown:
you know, kind of a startup is, most of the problems tend to be of surface area. So not necessarily, super deep into dealing with, you know, petabytes or exabytes of data, but more just dealing with the wide variety of it. You know, as I mentioned, Clickstream data, you know, that has its own unique properties of, you know, maybe some GUI, GUIP or user agent parsing. When we're thinking of, sort of the back end data, it all has its own schemas. And then, you know, since, obviously, we're not building everything ourselves, you know, I think last I checked, there's probably more than 20 external vendors, who we use, that also have their own data.
And so figuring out basically, okay. I've got all these different rich sources of data. How do we sort of make that into a meaningful analytical ready kind of version of that data is, really, I think, where our our problems are. And before you started
[00:08:06] Unknown:
migrating onto the Ascend platform, can you give a bit of an overview of the overall infrastructure and sort of systems components and some of the vendors that you're working with to try and be able to wrangle the data and get some valuable insights from it? Yeah. Sure. So,
[00:08:20] Unknown:
internally, in our Maven sort of engineering systems, everything is modeled as, basically a giant event driven application. So we use Kafka pretty heavily for that in terms of, events going through and then applications can react to events. So simple examples might be, you know, an order placed event that a fulfillment provider might react to and then, you know, decide to fulfill that order. So, internally, we've always kind of had that, which sort of leads to almost an extreme convenience into just kind of dumping more and more events in. You know, whether that means a vendor who integrates through webhooks or, you know, just, fetching some API report and then translating those into events. It sort of allows for bringing in more events very quickly, which is great on the ingest side. But I think on the, kind of what I was referring to of the analytical ready side, that's where I think you kind of now have to deal with this variety of data and all sorts of different schemas and all of that. And so in our original architecture, after it all kind of went through Kafka, that's when we used this, vendor called Alooma, who I think got acquired by Google last year. They helped us sort of detangle all of these events, decide which ones should go to which tables. We use Redshift mainly for analytics. So mapping those to Redshift tables, whether that's a 1 to 1, 1 to many, or a many to 1 in kind of a light very light transformation, I'd say. So it was definitely a kind of an ELT shop, in that sense. And, you know, that was kind of it. If the the goal was really just, okay, we've got all this data flowing. Can we just get it mapped to Redshift so that, we can then do, kind of the rest of our transforms and then start building up, the analytical tables that we want both for ad hoc analysis or, like, our data scientists use,
[00:09:59] Unknown:
or for Looker reporting. That sort of thing. And were you using anything like dbt for being able to manage the transformations of the data once it landed in Redshift and being able to have some sort of, versioning and source control over how you're actually manipulating it from the point of landing it in your data warehouse and whatever other, data storage systems you have for and then through the different, views that you're trying to get across it?
[00:10:25] Unknown:
Yeah. So we actually ended up going super lightweight there. So we started with, Amazon's data pipeline and sort of using its, you know it has, like, Redshift activity nodes and SQL transform nodes and that sort of thing. So, we started super lightweight and just kind of manually managed the actual source files in GitHub. So I'd say it was pretty, sticks and stones kind of trying to assemble that together. You know, that was 1 of the things that we sort of noticed was that because the initial tool we were using, Aluma, is more of a stream based tool and more set up for ELT, basically, we were kind of on our own for the that last transform. And that's 1 of the pain points that, we we wanted to fix as kinda going forward. And so
[00:11:05] Unknown:
that brings us to the point of deciding that you were at a tipping point where you didn't necessarily want to keep duct taping together a number of different solutions to be able to try and, gain some insights from the data that you do have. And I'm wondering what the final straw was that actually led you to decide that you needed to do a major rearchitecture and what the criteria were for deciding what the components were or what the new direction needed to be for being able to
[00:11:36] Unknown:
tame all the different types of data that you were working with? Yeah. Sure. So the the final straw is, well, not not as intentional from the Maven side as I I wish I could say it was. But, basically, after LUMA got acquired by Google, they were sort of ending Redshift support. So that kinda left us without really a vendor in place. But what I will say is even outside of that final straw, some of the the other big pain point that was happening is that because it was sort of a stream based transform, you you either kinda transform it at that moment or you kinda let that data go.
You know, obviously, there's restreams and that sort of thing, but it wasn't super easy to do. And so it led us to this kind of hoarder mentality in that, transform everything, keep everything because this is the shot. Whether or not it's actually useful, whether or not anybody even understands what this column means or what this event means. Let's just hold onto it. Let's just get it in Redshift so that we have it, which sort of inevitably led us to this sort of extremely bloated, both Alooma and Redshift with, tons of tables and tons of columns that didn't necessarily have a lot of business value, and added to just clutter and confusion. You know, when we were adding new people onto the team, there was just this enormous ramp up of like, what is all of this stuff? What do I actually need to use? Should I use this table or that table? They both look similar. And so I would say that's the the other major 1 that we were kind of looking to solve once we realized that we didn't have Alooma anymore. And so that kind of helped frame our perspective of what we were looking for in the search. As we started looking through vendors, we found that, at a rough level, we could kind of group them into 3 categories. 1, we called the stream based category. That was the Illumas.
I think there's a similar 1 called HEVODATA these days. We kind of bucketed those all into the stream based vendors. Then we found kind of the, the kind of the more, declarative vendors. That's what we kinda consider Ascend, which is sort of, you know, kind of more describe what it is that you want and not necessarily so imperative as, like, we're gonna take this event. We're gonna split it this way and that way. And so, you know, some others that we found, that we bucketed in that category might be like a like an upsolver or folks like that. And then the, kind of the 3rd category that we saw was, more of these, kind of either legacy 1 not legacy, I should say, but, you know, maybe some of these older generation ones, you know, like the, the Pentaho's and the and then the there's very specific, like, value based ones of, maybe like Stitch Data or Fivetran, where they're very opinionated and inform how exactly this is all going to go. And, you know, kind of we were used to the flexibility that Illumina provided us. You know, they allow you to write, kind of arbitrary Python. And so we knew we didn't wanna go in that direction of maybe 1 that that didn't allow us to be very flexible, you know, such as, like, a a Fivetran or that sort of thing. And so that sort of left us, like, you know, we could do a 1 to 1 stream based migration or, you know, we could take a look at some of these newer providers and what they're up to. And as soon as we started looking at these, kind of newer providers, Ascent obviously, being our final choice there, we immediately, sort of, just saw that value add. And how I summarized that value add is it, sort of, made change.
So as I mentioned in the LUMA, if all of a sudden we didn't map a column and, you know, we need to go back and get that column that was important, you know, that was just really expensive. You had to wait a while. You had to, go figure out how you were gonna go get the old data again. You had to wait for it to slowly, 1 event at a time, kind of restream in. You had to deal with the deduplication problem at the end. You know, it was just sort of a lot of, piecemeal steps that made change expensive. Right? But as we started looking at these newer providers, that's what we saw as the value add. Is we didn't need to hoard all the data exactly as we needed it as Redshift. But rather, we could treat Redshift exactly as more it was meant to be. It was, let's prep these tables for how we want them. Let's bring in the columns that have meaning to us, and let's not fear that maybe we made a mistake, or more importantly,
[00:15:24] Unknown:
you know, data will change, schemas will change. Let's be able to adapt. So, Sean, from your perspective, I'm curious what your experience was and at what stage you were when, Shield first came to you and wanted to trial the Ascend platform with the workload that he had at Maven and, any aspects of the types of data that he was working with or the variety of data that posed any sort of challenge for your, for the state of the Ascend platform at the time and, how that may have helped you in terms of determining your product direction?
[00:15:55] Unknown:
Yeah. It's a really good question. And, you know, I I think a couple of the things that, Sheila mentioned really resonated really quickly with us when, we saw a lot of their use case, which, well, I think first was, when, they were notified that they had to find another solution and were being forced into this pattern, and this reevaluation very quickly. We, of course, love because it it highlights the need and the requirement that data teams have to move quickly. And so that was actually, I think, and and, Shiel, you should me if I'm wrong but I think it was something like a week or 2 from, when you had first read about us all the way to like where you'd gone into the free trial, built some stuff, contacted the sales team, and and had essentially signed a contract with us, which is super cool but but highlighted how fast you have to be able to move in in sort of the modern era.
And I think the the the other thing about their use case that that really resonated a lot was around this need for flexibility. Right? You you have all these really interesting data sets coming in. They're the variety is incredibly high and she'll hit, I think, the point, in the nail on the head which was, like, you don't know what you're going to need yet. Right? So you this you start sort of get this hoarder mindset of I'm gonna go and keep everything, but you're trying to find that balance of you don't wanna go jam it all into your data warehouse because that's an anti pattern and it just abuses your data warehouse and clogs it up. But you need that flexibility to go create new derived insights and derive datasets on top of other datasets down the road. And, so, when we started to see a lot of those patterns emerge, it 1, it's 1 of the things that we had seen, many of the people at Ascend had seen earlier on in our careers, which really drove us to create Ascend. But, 2, it we think is a pattern that a lot of teams encounter. Now, as far as the the I think the the other part of your question you asked as far as product extensions or new features that we had built, I'm not sure. I'm sure there's something that we may have, needed to add on top of.
Shul, you could probably even comment, better than I could.
[00:18:03] Unknown:
Sure. There there were a couple, that maybe it'd help jog your memory. 1, I think, was, figuring out how we were gonna do exactly this geo IP parsing kind of in a declarative batch based world. It's a lot easier, obviously, to think about in a stream, but a little trickier to think about, and be efficient in a batch. So that was 1, and I think that sort of extended itself into sort of, joining partitions.
[00:18:25] Unknown:
So if you wanna talk about any of that. Oh, yeah. Thank you so much. That was 1 of the really cool exciting ones that we'd we did add. So part of this is in in a declarative model. You're really defining the end state. Right? You're not saying, hey, run this block of code every day at this time, and it's really interesting when you combine something like that with IP parsing because the the assignment of an IP address to a particular geography is a a time specific thing. And so 1 of the features that we added was this ability to do really optimized, joins of data that are time specific. And so, the in a declarative model, if you change code or you change a definition, the system itself will will start to recalculate those definitions, based off of the new the new model. But 1 of the things that you don't want to do is recalculate stuff that, is based off of a snapshot in time from yesterday or a week before, even a month or a year before. And so 1 of the things that our team honestly had a lot of fun going and implementing was this, this deeper level of time awareness so you could do these really advanced
[00:19:29] Unknown:
joins across different data sets and different partitions of data in those data sets that are more time aware and and sensitive to that. Yeah. That's definitely an interesting challenge of being able to reprocess your data in a batch manner for data that, as you said, is subject to change at any particular point in time and geoip being an interesting case of that where a lot of people might think about it as being some somewhat static, but particularly for the IPv 4, it's something that gets reassigned depending on who's bidding on those IP blocks. And so what today might point to Wisconsin, tomorrow might point to New York because that's just how maybe Amazon allocates them, for instance. So that's definitely an interesting approach and something that I hadn't really thought of before. Yeah. It's definitely 1 of the ones that in the imperative model that sort of a use pattern works well.
[00:20:16] Unknown:
And, so, that was the fun part in the declarative model is creating this this awareness. Right? In a declarative system, you always have the the underlying control plane is the thing that's dynamically generating activity work for you. And so you you're constantly adding more intelligence into that control plane, similar to how you would, a query optimizer or a query planner for a database. And so putting in these continued optimizations into that engine, so it can get smarter and smarter based off of the declarative model, the the blueprint that you create is is an area of really interesting just continued research and development, frankly.
[00:20:55] Unknown:
And, Shiel, I'm curious if there were any other edge cases that you ran into as you were migrating onto Ascend that were easy in that stream based approach but became either difficult or impractical? Or you just needed to think about a different way of approaching it in this declarative model? Yeah. So I think, the other 1 is sort of the taking the latest version of something. So I think if you have truly just an event stream, whether it's imperative or declarative, that's sort of not an issue because, you know, maybe they're all append only. But when we start getting into actual resources, maybe users and just kind of wanting the latest data of the user, that's definitely a little bit of a different thing to think about in imperative versus declarative. In imperative, it was as simple as you know, just kinda keep appending the rows, and then as we deduplicate data, just, take the latest 1. Whereas in a declarative approach, it's much more of, well, how you would think about it in SQL, really, which is, you know, let's, try to row number them across and get the earliest version by a curd at or something like that. It feels a little bit different. And it feels maybe a little bit more complex. But I think, under the hood, it's kind of the the same idea regardless. It's just sort of like, are you doing patch set updates or are you kind of redoing a bunch? And I think that is 1 of the things that, at least on my team here, that took a little bit of our time, sort of, wrapping our heads around, is that Ascend finds this happy medium in between redoing everything to find the latest and, you know, just that patch set between that very last event and the previous 1, in in its sort of how it treats partitions.
[00:22:22] Unknown:
It's it's really quite elegant the way it's done, but it it took our team just sort of a a little bit of a mental swap in in switching approaches just to understand how that works. And, obviously, you ended up having to retire the dependency on Illumina because of the fact that that's what pushed you on to Ascend. But now that you've been using the Ascend platform for a little while, I'm curious if there are any other aspects of your data infrastructure and overall architecture for your data platform
[00:22:50] Unknown:
that you have been able to rethink or reimplement or even completely retire as a result? Yeah. Quite quite a few. So, you know, as we kinda talked about, we had kinda patched a bunch of different things together. So, you know, outside of the event stream that where most of our data comes from, you know, we have a decent number of, basically, API fetchers, if you will, that just kinda need to go to, like, the Facebook a grab our latest ads data or, you know, go to our reviews platform and grab the latest reviews, that sort of thing. And so we were sort of managing that, ad hoc in, data pipeline. So, you know, kind of some custom coded scripts that run-in data pipeline that fetch that data in. That's 1 that we've been able to retire and thankfully so. Ascend has, sort of, custom read connectors. And so you can, sort of, write Python scripts that are, both run and managed by Ascend. And so we were able to convert those all into that. So that's gone.
Then on the actual Alooma side, we covered that. And then from when the data goes into Redshift, you know, as we talked about, we we had a bunch of additional data pipeline jobs that, you know, might retransform that data using Redshift. And so, we've been able to start extracting those into our actual ETL pipeline. So, you know, instead of it being more of a pure ELT, it's a little bit more of an actual ETL now. And so, we don't necessarily need to run these expensive scripts once a day in Redshift, which has led us to 2 things. 1 is that, well, the Redshift doesn't randomly spike once a day. And then, the second is that the frequency that the data is available to both our data team as well as everybody else within Maven has actually gone up, you know, because we can do this on sort of a higher frequency. And so, people are getting fresher data. People are reacting a little bit faster to, you know, maybe an ad that's not performing as well as it used to and that sort of thing. So we've been able to take out all of those components with this migration. And I'm wondering what the overall process was for making that migration
[00:24:41] Unknown:
and what the time line was that you had available for being able to cut things over. And then finally, as part of the overall migration, I'm wondering what steps you made to perform any sorts of validation or testing to ensure that the results that you were getting before and after were still consistent and that you were able to ensure that you either weren't missing data or that the transforms weren't hitting some sort of bug that was causing a variance in what the overall results were of the analysis that you're performing? Yeah. Sure. So we dedicated,
[00:25:13] Unknown:
you know, quite a bit of resources, in order to make sure that this migration is successful. So, you know, on our data team is 4 people, and we pretty much used about 2 thirds of everyone's time, full time for about 4 to 6 weeks, I would say, to cut over. And so, you know, I think that includes, not only just sort of 1 to 1 mapping of Aluma, but this additional transforms that I was talking about that were being performed elsewhere, sort of moved into Ascent as well. So that's kind of what the project timeline and bandwidth looked like. As we keep going, of course, there's just additional resources required as new events come in and that sort of thing. But, it's it's going a lot quicker than, Alooma was. And then, I think the second part of your question there was sort of, you know, in the migration, how do we know that, you know, we're ready to go live and that the data is accurate? I think in any migration, we tried to follow some best practices where we tried to keep most of the base tables in a 1 to 1 style so that we could more easily audit. Instead of trying to bite off more than we could chew in 1 transform, given such a powerful tool, we still kept it fairly basic. We didn't really get too fancy with doing too many, joins or row operations or anything like that that we weren't doing in Illumina just because we could do that. So that helped with the sort of the keyway and the validation quite a bit since it was much more of an exercise in, you know, I have this many rows over here. I should have this many over here. You know, these columns kind of look like this. They should look like this. It was a lot of that basic profiling of our data, making sure that that profile still matched. You know, important things like making sure that the revenue for a particular month was still the same revenue, that kind of stuff. And so that helped us quite a bit. And then it got into the nitty gritty of what you mentioned of, like, you know, maybe particular transformations and subtle sort of logic errors. And, you know, I think 1 of the things I like about, Ascend and Spark, which underlies it, is that, you know, in general, if the if there is an error, it tries to push it down as far as possible and keep things, sort of as a null. That makes it a lot easier to audit and verify because then instead of dealing with, like, mysterious errors in your ETL pipeline of, like, you know, line 46 cannot call this function of nil and trying to figure out which row that might possibly relate to in your massive dataset of 10, 000, 000, 000 rows. You actually have it all in your database already. It's not perfect, but it's somewhere that it's queryable. And so that was really nice in this as well, is that at the end of the day, what data folks are most familiar with is queryable data. And so even if the ETL isn't perfect, at least the fact that it made it all the way through and is queryable very quickly let us diagnose any issues, figure out which rows were having that problem, you know, kind of just using queries and then fix that up. Very easily just change that script in Ascend, let Ascend, rerun it, does its sort of auto magic, if you will, and then, sort of, just check it again in our redshift, which, sort of, gave us a value add in a different feature that Ascend has as well, which is, in each step of the transformation, Ascent lets you write SQL queries, to kind of query your data. That turned out to be hugely valuable in this validation process as well because not only could we then query the, the redshift to see what went wrong, but we could then kind of sort of rerun, like, a modified version of that query at each transform node and then really track down where it went wrong, you know, and especially in a more complex transformation with several steps. And now that you have been using Ascend in production for a while, you've been able to replace some of your preexisting infrastructure and optimize some of the overall query patterns and ETL jobs that you're building. I'm wondering how you have been able to use the additional capacity that you have both in terms of compute infrastructure and scalability, but also in terms of human capacity for being able to either onboard new data sources to gain new insights from those or take new looks at the data that you do have or integrating them into overall analytics? Yeah. Definitely. So I would say time for change, like, either, the amount of time it took to get a new data source in or update an existing 1 has sort of gone down from what was a day's 1 or 2 days' worth of work, down to, honestly, less than an hour where someone can very quickly, like, kind of propose a change. Somebody else can, you know, take a look at it. And by the time it goes live and Ascent scales up and replays everything, in an hour or 2, that data is already mapped to the database. So that has been markedly different. I would say we used to spend 1 fourth of the team's time just trying to keep up with all the new data sources and changing schemas of given events and that sort of thing. And that's decreased to oh, gosh. I would say pretty much negligible at this point because, you know, again, in the Luma, it was a sort of a hoarder's version. Right? And so you had to keep up with every single schema change, every single event change. Whereas, now, it's sort of more pull based. So, if you want new columns, you can go explore, you can go find them, and you can go bring them in very quickly. But if you're pretty happy with the way things are, you don't have to keep up with every single schema change. So that has added up to a huge amount of human savings, within Maven, which then just translates to, you know, more of the actual projects that we need to be doing at Maven. You know, Maven is obviously not a data infrastructure company, and so, that's just not where we wanna be spending our time. We wanna be spending our time, you know, finding the best stylists and matching them with our customers.
So, you know, that's what that all translates to. And to your point about the time from identifying a data source to getting it active in the platform, I'm curious what the difference in the collaboration style or the overall workflow looks like between where you were to where you are now. So, I mean, I think in that sense, it's similar. You know, both of them are code based. So either, in Alooma, you're writing Python code and, in a sense, you're sort of either writing PySpark or SQL code. And so, given that it's code, obviously, we've always had somebody to review any of those changes. So, that part remains the same. What I will say though is our data team is made up of analytics managers, data scientists, people just starting to learn that.
Different people are in different places. Ascent has allowed us to democratize more of that process. So, when it was all written in Python and Alooma, the number of people who could actually review a change, that was going into the stream was quite few. It was somebody who had to understand, both Python, unit testing in Python, you know, kind of both of those in order to be able to look at it and then also have a good understanding of the schema and everything that might be related to it. In the Ascend version of this though, because it's SQL or PySpark and, you know, a lot of PySpark looks and feels like, SQL abstractions. You know, basically, our whole team is now able to participate in this process, that used to be just a couple of folks. And so that has actually been a better collaboration style. And are there any other aspects of your experience of migrating overall data platform, either anything specifically about Ascend or the process of identifying new platforms or rearchitecting
[00:31:58] Unknown:
your systems or from your side, Sean, anything that we didn't discuss from your end of, bringing Maven onboard and working with them to identify the optimal way to take advantage of your platform,
[00:32:11] Unknown:
any of that that we didn't discuss yet that either of you would like cover before we close out the show? Well, so, you know, 1 of the things that that I would certainly double down on that Sheila's highlighted is, you know, I think the the new metric really progressive data engineering teams is no longer the how much data can you ingest or how fast can you process it or how much of it can you store. But, I think it's really changing to now how fast can you move. How many new insights? How many new, data products can you produce? And and, frankly, a lot of that comes down to how can you really minimize and optimize the, the maintenance burden on teams so that you can really focus the lion's share of your efforts on new and and to renew creating new value. And I think that's what's really exciting in working, with Sheila and the team has just been they're laser focused on that, and I think totally embody
[00:33:00] Unknown:
that new focus from a a leadership perspective of how do we just enable teams to go faster. Yeah. The only thing I would add is, sort of a tangent from that, but something that our engineering team has actually brought up since they've been working with, Ascend a little bit more is that, you know, I think there's a joke in engineering that, so many problems are really just taking some data, transforming it and pushing it somewhere else. Whether that's through API or some batch transformer, you know, some SFTP type situation. And so, when you really take like a bird's eye view of Ascend, there's a lot of problems that would have to be solved with, kind of, custom code that can be done really well in a platform as flexible as this. And so, we haven't gotten here yet, but the engineering team has a lot of interest in using this end platform in use cases outside of maybe this ETL.
So an example is, maybe with 1 of our vendors, we need to upload, you know, maybe some basic order information over the past month in order for them to do some reconciliation. Of course, you could write an app to do that, but then you take a step back and you're like, well, this ETL pipeline is already doing all of that work. Couldn't we just do another, transform? Couldn't we just do another write output that could handle all of this? And so, you know, I think I don't know if this is necessarily something that Ascent finds interesting, but it's something that a lot of people within Maven are talking about as a way to structure things a little differently. Yeah. That's definitely interesting seeing some of the ways that the lines are blurring between data infrastructure
[00:34:23] Unknown:
and data engineering and data science and application infrastructure and architecture and how the tools are becoming flexible enough that they're actually starting to bridge the divide where they can be used in all of these different contexts so that you can leverage the underlying platform without having to have these bright lines that separates what is being used for which aspects of the business. Exactly. Alright. Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you each add your preferred contact information to the show notes. And as a final question, starting with you, Sheila, I'd like to get your perspective on what you see as being the biggest gap of the tooling or technology that's available for data management today.
[00:35:00] Unknown:
Yeah. That's interesting. I think, we're still in a place of so many tools. I think, what's happening in the data industry is that, you know, it kind of reminds me of JavaScript and front end development. There's a lot of problems that are very quickly being solved, which sort of leads you to okay. 1st, we'll grab the data from here, then maybe it goes through Kafka, then maybe it, goes through this ETL vendor, then maybe Redshift. Oh no, our Redshift is getting full. Let's offload that to S3. It's like, oh wait, but we want to query S3. 3. Let's bring that back into Athena or Redshift Spectrum and, you know, tie that together. Oh, but we wanna extract that and then, run data science on it, so let's run EMR jobs on it. And, you know, I think where we're at is it's just a massive proliferation of all of these tools. And it's a lot to keep up with in the data space in terms of maybe what somebody's up to. And so I think, these higher level orchestration tools and higher level coordinations that, you know, start abstracting away these details, of necessarily exactly how things are done and which exact technology choice we're using there. I think it's gonna become valuable to allow people to be more productive and not necessarily have to be experts at every single tool. And Sean, do you have anything to add as far as your perspective on the biggest gap that you see in tooling or technology that's available for data management today? Yeah. I'd say I'd I'd certainly build,
[00:36:17] Unknown:
reinforce and build on top of what Sheila is saying, which is, you know, I think we have a lot of really interesting technologies that are designed for point solutions and they make it really easy to prove out a concept or build your first pipeline or write your first query. But the tools, I think, are are really geared today for, how do you do a a great hello world, and sort of first use case. And I think there's been less time spent on how do we make sure that these systems are incredibly low maintenance burden. So much so that I think it's a bit of a, a badge of honor to say you're a data engineer because everybody is like, oh, yeah. Like, you have to go deal with all these things that break all the time, and and it creates a a certain bond among all of us.
But I I do believe that similar to what we've seen in other domains and, technology spaces like infrastructure, for example, that there is the ability to build these higher abstraction layers that as a result have a better understanding of all of the underlying pieces And via simplification, via better modularization, of capabilities and and data models and and so on, can actually automate away a lot more of the underlying pain points that make our job as data engineers cleaner and simpler so we can focus on more of the the the new data product creation, if you will. So, hopefully, we we continue to see this,
[00:37:41] Unknown:
as an increasing trend in the in the years to come. Alright. Well, thank you both very much for taking the time today to join me and discuss your experience of working together and migrating Maven's data platform onto a new infrastructure that has enabled them to move faster and gain more value of the information that they're collecting. It's always interesting to be able to get some inside perspectives on these types of operations. And so I appreciate both of your time and efforts on that, and I hope you enjoy the rest of your day. Thanks so much for having us. Thank you. Listening. Don't forget to check out our other show, podcast.init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at data engineering podcast dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at dataengineeringpodcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Message
Guest Introductions: Shiel Chokshi and Sean Knapp
Maven's Data Management Journey
Challenges in Data Collection and Processing
Initial Data Infrastructure at Maven
Decision to Migrate to Ascend
Ascend's Value Proposition
Edge Cases and Challenges in Migration
Migration Process and Validation
Post-Migration Benefits and Workflow Changes
Future of Data Management Tools
Closing Remarks