Summary
Data engineering is a relatively young and rapidly expanding field, with practitioners having a wide array of experiences as they navigate their careers. Ashish Mrig currently leads the data analytics platform for Wayfair, as well as running a local data engineering meetup. In this episode he shares his career journey, the challenges related to management of data professionals, and the platform design that he and his team have built to power analytics at a large company. He also provides some excellent insights into the factors that play into the build vs. buy decision at different organizational sizes.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.
- The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.
- Your host is Tobias Macey and today I’m interviewing Ashish Mrig about his path as a data engineer
Interview
- Introduction
- How did you get involved in the area of data management?
- You currently lead a data engineering team at a relatively large company. What are the topics that account for the majority of your time and energy?
- What are some of the most valuable lessons that you’ve learned about managing and motivating teams of data professionals?
- What has been your most consistent challenge across the different generations of the data ecosystem?
- How is your current data platform architected?
- Given the current state of the technology and services landscape, how would you approach the design and implementation of a greenfield rebuild of your platform?
- What are some of the pitfalls that you have seen data teams encounter most frequently?
- You are running a data engineering meetup for your local community in the Boston area. What have been some of the recurring themes that are discussed in those events?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
The only thing worse than having bad data is not knowing that you have it. With Bigeye's data observability platform, if there is an issue with your data or data pipelines, you'll know right away and can get it fixed before the business is impacted. Bigeye lets data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user friendly interface, and automated yet flexible alerting, you've got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.
Your host is Tobias Macy, and today I'm interviewing Ashish Marig about his path as a data engineer. So, Ashish, can you start by introducing yourself? Yeah. Hi, Tobias. Thanks for having me on the show. So hello, everyone. My name is Ashish, and I'm a data engineering manager at Wayfair.
[00:01:49] Unknown:
So I started my data engineering journey about, I would say, 20 years ago when they used to call it database development and when Oracle 7 was probably the cool thing on the block. So I kind of accidentally got into this and sort of started my career as a senior developer architect and got into big data. And for the last 8 plus years, I've been managing data engineering teams. Mostly, we're doing cloud first deployments, working with AWS and GCP. And I would say I work with pretty much all different types of database. So like RDMS, MPP, distributed, or Hadoop or NoSQL. So, obviously, there are probably hundreds of databases. But in a category wise, I'm a little bit most of them. And, I feel like I have a good grasp on how things are aligned in the data engineering space right now.
[00:02:46] Unknown:
You mentioned that you've been working in data management for quite some time now. I'm wondering if you remember how you first got involved in the space and what it is about sort of the data ecosystem and the problems that are involved that keep you interested and motivated to stay working in this area?
[00:03:05] Unknown:
So I think it's more sort of organic evolution. As I got into the seniority, I found myself helping other team members and and it organically grew into a team leader and then as a data manager. So I don't know whether this this was, like, a pre designed or predestined thing on my part. So my first big break was in a company called TiVo. For those of you who are younger, who may not know that TiVo used to be the cool digital video recording box in nineties. They were still around about 10 years ago and then I joined them. We were in the business of monetizing TV viewership data. So we would get anything you do on television set up as we'll get that data and then we would aggregate it and then make it into product that we're selling to advertisers and to use to do. So I think that was my first big break where I was managing the big data team. That was the scale of the data was huge, and we were dealing with terabytes.
And that was the first sort of AWS cloud deployment, you know, working with Spark and Hadoop and all the distributed technologies. So I think that's how I got into the managing the data teams. And I found this was a very rewarding experience. I could bring a lot of experience, a lot of my sort of working in the benches knowledge into the table and help teams and companies build a scalable and reliable architecture that is probably harder to do in data than other space.
[00:04:36] Unknown:
So as you mentioned, you're now leading a data engineering team at Wayfair, which is a fairly sizable company. And I'm sure you have a pretty significant volume of data. And, you know, you're probably hitting the maximum on all 3 of the v's of variety, velocity, and volume. And I'm wondering what are some of the kind of main topics that take up a lot of your time and attention in terms of being able to keep your team on track and be able to build data products and build systems that are able to manage the sort of scale of information that you're working with? So
[00:05:15] Unknown:
as a manager of the engineering team at Wayfair, 1 of my job is to look ahead. Meaning, I'm looking 2, 3, 4 quarters ahead of the team. My team is working on the current road map and delivering day to day things that people need. But I'm looking ahead and trying to not only figure out what problems we'll have, but also solve those problems in advance. And that's obviously not a easy job to do. So the thing that I typically tell people is think of the data engineering as a big lumbering ship. You can change the direction, but it takes long time to change your direction. It takes a lot of effort to change your direction. So once the ship is set on a course, we're moving on that course and and it takes a lot of time and effort to change and chart a new direction. So we have we have to be careful and plan our road map before we start working on it. So this is maybe different than application development where people are doing things fast and failing. Right? I can't start working on, let's say, Cassandra and do things incorrectly and then fail and then come back and do things differently. Because I need to make sure I provision the right amount of nodes, the right compression, the right compaction, the right sort of algorithms, and process the design data models in place before I can even write 1 piece of code. So we're thinking about front loading the project or the initiative with a lot of thinking into the design, How the data should be organized and what are the tools and processes that will go into moving the data. So once we build that process, changing it or managing it as a big task. So that's where the ship analogy comes in.
So my big thing is to make sure that stage is set for the team to come in in next quarter or the following quarter, do the things they want to do, or they are required to do, and not be bogged down by some of their details. So I kind of and the what is the word they use? I'm kind of the forward team or the and last team that is kind of doing the reiki and setting the stage for the team to come in and execute.
[00:07:28] Unknown:
So in continuing with the ship analogy, as you're trying to kind of captain the ship and chart a course, what are some of the kind of icebergs that have tended to pop up in your way that you need to work around and be able to help kind of steer in advance of so that your team doesn't run into a sort of catastrophic situation?
[00:07:50] Unknown:
As a data engineer, we are almost always traveling to different worlds. The first word is we are engaging the stakeholder and giving them what they need. Right? Another day, we are solving a business problem. Right? We exist because there is a business that is being run and their business needs data from us. So that is our primary goal and need. And the second leg that we are standing is the technology work, and I don't wanna tell you this, especially in the big data land, the amount of technologies and services that are proliferated is humongous. So making sure we are setting the sales course for the right technology design architecture is critical. Otherwise, if you get bogged down in in the second, sort of, we're not delivering on the first 1. Right? So that goes hand in hand in glove. 1 example that I'll give you and the iceberg that I always try to avoid is the so called data migration.
So we went I think almost all companies did this. All the companies, maybe nineties, early 2000, we were all on prem. Starting 2010 or maybe around that time. We all said we will go to AWS or Azure or in cloud. So we have spent enormous time going to cloud. And now we are in the cloud, but we are also kind of stuck with 1 application, 1 vendor. So in our case, we are probably not stuck, but we are using BigQuery. But tomorrow, let's say, a new technology comes around, which is 15 times better than BigQuery, then we're looking into 1 more data migration. And data migrations are time consuming things and they're hard. And they're even harder to undo in cloud, especially if you're moving vendors.
So I spent 6 months doing data migration from on prem to Google Cloud, where we do not deliver a lot of value for the stakeholder. So I don't wanna do that again, and that's not good for the business. So we need to think about future proofing our architecture in a way that protects us from any new technology that or less embracing new technology without doing all the cost we do. So that is 1 iceberg that I always try to look out. But the flip side of it becomes a political issue and the sense that why are you trying to solve a problem that doesn't exist yet? BigQuery works fine. So and there's no nuclear on the block. So who knows? So we will do BigQuery foreseeable future. So that is, I think, the mind field and the iceberg that we have to, as a manager or leaders, we have to sort of straddle and and work through and make sure that we are not hitting to them and continue to team with us.
[00:10:31] Unknown:
The main thing too with the cloud adoption is that, well, it does give easy scalability of, oh, I can, you know, pay for what I need right now and just scale up as I need to with when you do actually start to hit that scale up point, then the costs somehow seem to manage to scale superlinearly. So you need to figure out, okay. How much data am I actually going to need to use down the road? What are my query patterns going to be? How am I going to mitigate some of these, you know, unforeseen expenses that come about because of the fact that the cloud is so dynamic?
[00:11:03] Unknown:
I think you hit the nail on the head. So back in the days when we were on prem Oracle shop, we used to spend so much time thinking about the design architecture and tuning our queries because we have 1 box, 1 server, which is fixed on memory. We can't scale scale it. It takes 2 months to get a new 1. And it takes almost act of God to ask for patching and provisioning and all those things. But now, everything is on demand. So you're right. I've seen, especially in the new sort of I don't wanna make it generational wall, but new generation of, you know, engineers that the first line of defense is, oh, let's throw more hard disk or let's throw more memory at the problem. Instead of tuning your queries or thinking about the design or solving the problem with the design, people are solving it by throwing more resources, more compute at the problem and which invariably ends in only 1 way and that way is with an email from your head of the infrastructure or whoever is managing resources that your enforcement is out of the roof and you need to validate.
Until somebody puts guardrail, I think it's a natural human tendency that we take the path of least resistance. And that is another iceberg, I think, you in the technology that we, as a manager, always kind of asking people and coaching people and guiding engineers to make sure that we do not grow in order of code design, architecture, and the hardware is the last resort and not the other way around. But, again, that also becomes, political potato where people lead things fast. And and doing it the right way is time consuming. So it's a matter of finding those battles and explaining those.
But fortunately, Longyearfield is a rich company, so we can afford all those hardware that Google is giving us.
[00:13:04] Unknown:
From the team management perspective, because you're working with all these data professionals, you have these, you know, high impact projects that can make or break the business. What are some of the useful lessons and strategies that you've gathered in your time working at Wayfair and other companies to help keep your engineers motivated and on target. And because of the, you know, continually changing nature of the ecosystem, helping them to understand what are the valuable ways to spend their time, like, what are the lessons that are worth learning with these new technologies, and how do you identify the cases where it's actually just a flash in the pan and it's not actually bringing anything new? It's just, you know, putting a new coat of paint on something that's been around for decades.
[00:13:50] Unknown:
Yeah. That's a really good question. And I think it's a problem that every engineering manager, whether data or any different type of software engineering discipline has to or has been experiencing. So engineers, by nature, are a very opinionated bunch and very high maintenance. So I think 1 of the key thing is we need to make sure that we are worrying about their development sorry, career development and making sure that they are thinking in the right direction. I feel like that gives them lot of job satisfaction. So what I tell people is don't run after the next kid on the block, but have your basics.
Solidify your basics. What I mean by that is understand the core concepts of data warehousing, data modeling, and big data. They are still relevant whether you're using Redshift or BigQuery or Synapse or whatever else. So having that core knowledge and understanding of arranging data and what are the pitfalls or what are the in terms of understanding what makes or breaks a data project. I think that is what it's where in gold. But at the same time, thinking about their career development, giving them a chance or giving them room to work in latest set of technologies, and making sure they're marrying that chance with solving a business problem.
Again, what is where the goal 1 example that I'll give you is as a part of credit open in 1 of my employees, I asked them to work on building a data quality framework. As you can imagine, the data pipelines, which are managing data, have personally, a lot of data quality problems. And we don't want to find out retroactively after even has passed from the client. So we do have this data quality framework design using, I would say, a very innovative metadata driven data model using Python to build a data model where anybody can go in and define their data quality checks and then log that checks into some sort of tables and then point dashboards on top of that for easy consumption. So that whole project is a very ambitious thing, but it gives people a chance to flex the data modeling muscle, work in Python packages, and work with something called InfluxDB or for event logging and build the dashboards. But at the same time, we are also solving a business problem where the customers are now able to see why something went wrong or what was the extent of going wrong by doing the data quality measurements proactively on the data pipeline. As the data pipeline gets published, we also measure those metrics.
And in some cases, we are even stopping the data pipeline because we define thresholds that if the data crosses this threshold, then let's not even publish that data. Right? So that, I think, is a very satisfactory, sort of, project that people like to do or they especially get engineers because that not only solves their business problem, but also expands their technical horizon and keeps them current and keeps them, sort of, happy with the with the work.
[00:17:07] Unknown:
As you're talking about the, kind of, data quality work and the challenges there and building something in house, it also brings up the question of, as an engineering manager and as somebody who does have a lot of background in the data ecosystem, I'm curious what your thoughts are on the modern data stack as it were, and what are the actual useful pieces to pull out of that? What are the pieces that are not worth spending the time as a relatively mature organization on trying to kind of rearchitect around because you run the risk of, you know, dying a death of a 1, 000 cuts from all of the different SaaS bills and, you know, some of the sort of cost management aspects that go into, you know, buying into all these different managed systems and, like, the kind of build versus buy equation?
[00:17:59] Unknown:
Yeah. I think there are 2 dimensions to this question. 1 is the build versus buy in general for data engineering. Secondly, for a big company like Wayfair. So let me address the second 1 first. I think Wayfair, similar to Amazon, has, for the most part, adopted the model that if you can build it, then why buy it? Right? I think Amazon has taken to extreme because they can. They're big. They don't use any commercial product because it doesn't scale for them. But, also, obviously, we are not as big. We have things that industry leaders like Jira or Git or, obviously, Google Cloud Platform Services, BigQuery's other word. Whoever is the industry leader, I think we are not reinventing the wheel there. But to your point, in terms of the current data stack, I think, it lends well to solving big data problem.
But it doesn't lend well to solving the process problem. What I mean is most of the products that are being marketed right now are marketed as a way you don't have to run a single piece of code, like Snowflake or these new cubes or new Apache sort of products. But we do want to encode because we are an engineering organization. So we are not afraid of any code. A data stack which lends to the technology technical landscape landscape than marketing to the business landscape. Right? So, for example, there used to be 2, like, Informatica or DataStage used to, do well in the client server era. But in this big era, I would see a single ETL tool that lends well to writing your ETL as a decoupled application and then point it to any computer, whether we want to run that ETL as BigQuery process or as a Spark process or as a Huddl process. Maybe that's too ambitious, but I don't think that's, also the random possibility. Like, if I express my business process as a SQL, and that SQL can transform to hide SQL or Spark SQL or Ubiquet SQL or Redshift SQL. Why can't we build a tool like that which is geared towards the developers?
They're having few startups in this era, but we're yet to see the best to come. Right? So I think that is maturing and that is a business opportunity that is begging to be taken. Building something that is platform and Cloud agnostic.
[00:20:33] Unknown:
Yeah. That's definitely been a recurring conversation, particularly over the past month or 2 for me is the kind of question of how do we manage to build, like, these database agnostic processes where, you know, DBT has taken off because of the fact that it's easy for data analysts to take their SQL skills and be able to level up into more sort of repeatable workflows by pipelining these different transformations. And that's great until you get to the point of, okay. Well, now I need to do the same thing on Snowflake and BigQuery. And, oh, now I'm also gonna need to run it on Redshift. And so then it's a matter of, okay. Well, now I need to abstract this in Python and, you know, maybe then I'm pulling things into Panda's data frame so I'm not taking advantage of all the processing capabilities of the data warehouse. And so now I'm back into the, you know, the old problem of I need to figure out what my distributed execution framework is so that I can do all those data processing out of band versus being able to make use of the data warehouse that was supposed to be my saving grace.
[00:21:36] Unknown:
Right. And and to throw in 1 more variable in this thing is if we want to do our data processing in real near real time or real time, then some of these analytical databases don't lend well to near real time data processing. And then we're talking about if you bring in NoSQL databases like Bigtable or Cassandra or Edgebase, then that's a a whole new problem that we have to solve because they are not very conducive to SQL. Building a SQL layer on top is clunky at the best, I would say. I don't know if people have solved this problem. Marrying the analytical data with the events data and storing 2 different, workloads at the same or processing 2 different workloads at the same time and then make sure they're all in synergies fit together. Right?
For most part, what I've seen is people are replicating their data models in analytical space to the NoSQL space in a different, obviously, format because NoSQL has a different modeling techniques. And then using the NoSQL for the lookups for a low latency reads like API endpoints. And then using analytical for, obviously, more heavy duty processing. But then that means we're picking up processing and picking up models and different data. And I don't see any cohesive technology right now that sort of streamlines it and makes it all in sync with each other. But this is an opportunity that's
[00:23:10] Unknown:
gone a big deal. Yeah. There are definitely some of these problems that are, you know, as old as data itself, and there are some problems that we seem to keep creating new ones every time we solve old ones. And I'm wondering what are some of the sort of newly generated problems that you're currently tackling and some of the ones that have been consistently problematic for you throughout your career?
[00:23:32] Unknown:
1 of the things that we, maybe, for past few years is separation of storage and compute. I think I've talked about this earlier as well. You can definitely put every data you have in, in Redshift or whatever tool that you're using, but then that ties you with that vendor and that technology. And and if something new comes along, then, like, Presto or Druid or something, then you're not able to use that. And then if you do use it, you're ending ending up duplicating your data and your processing. So 1 of my, sort of, pet peeve is to make sure I'm creating my data lake, or they call it as lake house or a mesh or whatever you call it, in a agnostic layer like s 3 or GCS and then point my computer. So that is problem that repeatedly keep on experiencing and it's not that it's a difficult problem to solve. It becomes more of the organization wants to spend that much resources and time to build that foundational layer in an agnostic platform and then appoint computers on top of it. In other words, are we ready to be strategic and tactical?
And in most companies, what I've noticed is people are more tactical because there's a lot of pressure from the stakeholders, obviously, but they also do cycle land. You need to show what you accomplished for the business. More like a quarterly stock report or quarterly earnings. So you have to be continuously showing that you are doing something and not just building this, quote unquote, pipe dreams that we will realize it maybe towards end of the year. So I would say less a technical problem, but more organizational. But how much organization is willing to back this? A lot of these Silicon Valley organizations are decentralized.
There's not a lot of, I would say, centralized thought going into the process. All the most of the teams are doing their own things. In that sense, people are taking the path of least resistance, I would say, and then doing things as they see fit. So but the flip side is that by doing that, like, companies like Amazon, Google, they have put so much premium on velocity. Doing things fast. I'm arguing that is not always the right, especially in the data world, where we have to, maybe, pay our debt initially and do our due diligence and build our design and architecture and our lake houses before we can deliver a single report to the client. But you can't have them. You can't have velocity and then have the right item. So we are kind of in that cycle where we are doing paying this debt again and again. So that's, I think, my number 1, sort of, thing that I keep doubting. Secondly, what I encounter is writing ETA jobs, 1 of ETA jobs. So if a stakeholder client gives you some requirement, you go and build the ETA job in table. And what the end result is, your code base and your systems probably fit really fast. The complexity goes really, really fast.
Because if you have 1, 000 reports, then you end up having 10, 000 tables and 1, 000 new jobs, which is which gets almost impossible to manage. Right? Especially, if you made it complex using some sort of bullshit technologies like Apache Beam, which is more a collection of bugs than a technology. So I am they are I don't know what they're trying to solve by using both batch and real time in 1 flow. But, anyway so my, sort of, problem that I keep solving is do not write ETLs as 1 of things. Don't write 1, 000 ETLs. Instead, write 1 ETL framework which should be configurable, which should be matter driven, and then define your business logic using SQL, which are easy to manage and maintain.
And then the framework then spits out the jobs and does the work for you. So kind of like what machine learning does, but obviously not as smart. So don't write code, but write frameworks and build in intents.
[00:27:45] Unknown:
In terms of the architecture that you have been building out in your time at Wayfair, what are some of the kind of design choices that have been useful and some of the decision making that you've made around the kind of technology choices and where to invest your time and energy?
[00:28:05] Unknown:
Back in the days, we used to especially in the analytical world, we used to expose data in a bulk format. And most of our design architecture choices are geared towards that to give you large amount of data reasonably fast manner. So whether by table or a file or what have you. But I think the the game has changed now. So people want not only the analytical data, but also a bit more in transaction banners. So people want to see how much, for example, in Wayfair case, how what was the change in the Wayfair price for a given product? And they wanna know why a service, why an API, or why a push notification. If the price of this product goes beyond $100, for example, I need to know it. Right? So that's not a bulk use case. So we are not designing for 2 different workloads, for analytical workloads and for bit more transactional workloads. I'm not using the word transaction in that sense. We're we're doing some business transaction, but more in a event manner, people want to know. Or maybe send them a Kafka message or something.
So the big ways of the world don't scale well for low latency querying. And so we have to now design both the analytical and the low latency workloads. The other thing that has sort of evolved is how do we present data to the clients? And there are so many different ways to do that. Only 1 metric that I heard is far still true is most people use Excel. But there are proliferation of dashboards and reporting software. And when I first started Wayfair, we were doing a lot of designing of these reports, doing the lot of front GUI part of it. And that is probably not the best use of data engineering time to design those widgets and buttons and layouts.
So 1 of the design paradigm that we have adopted is to move out of doing that and make our customers self self reliant. So we follow self-service BI model where we give them the analytical layer or semantic. We build a semantic layer for them and then they can use the Looker or Tableau, so the word, to drag and drop the data they need and build the widgets and and reports themselves. And for the customers who are not as savvy and who are mostly reliant on Excel, we have what we have done is we have built a multidimensional OLAP cube using a distributed cubing technology called AtScale.
And that works very well because that has a native support for Excel and then clusters into pivot table and through which the tablets can do whatever they want or need. So by doing, I think, these few things, we have refocused the data engineering teams to our core competencies, which is the data modeling, the design, the API pipelines, data pipelines, data quality. So I think in that way, we have moved the team away from those sort of low performing tasks.
[00:31:14] Unknown:
Today's episode is sponsored by prophecy dot io, the low code data engineering platform for the cloud. Prophecy provides an easy to use visual interface to design and deploy data pipelines on Apache Spark and Apache Airflow. Now all the data users can use software engineering best practices. Git tests and continuous deployment with a simple to use visual designer. How does it work? You visually design the pipelines and prophecy generates clean Spark code with tests and stores it in version control. Then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage.
Finally, if you have existing workflows in AB Initio, Informatica, or other ETL formats that you want to move to the cloud, you can import them automatically into prophecy, making them run productively on Spark. Learn more at data engineering podcast.com/prophecy. And another interesting aspect of the current landscape of the data ecosystem and particularly data engineers and data platforms is we've moved beyond just the need for answering analytical queries with our data to the point where we actually have to start managing more large scale machine learning workloads and maybe being able to handle reinforcement learning and being able to manage the feature engineering and low latency queries and, you know, real time aggregation of that information. And I'm curious, how are those demands starting to manifest in terms of the work that you're doing and the ways that you are thinking about steering the ship to be able to enable both the analytical use cases that have been, you know, revolving and I'm sure are largely mature and stable alongside the requirements that various machine learning workloads are placing on your team and your infrastructure?
[00:33:07] Unknown:
There are not so many different types of stakeholders that I don't think we should even call ourselves an analytical platform. So machine learning and data scientists are obviously the key ones. But funnily enough, the engineering teams to which we source our data, they are also our customers because they need to also consume some of our data that we are gathering and building, as a historical source of truth to do some of their data processing or, transactional processing, application processing that they're doing. So that's also another use case where they we need data in more transactional manner. Well, whoever is our customer, the underlying paradigm still remains the same. That design is the king. That is what I keep telling people. Don't be reliant on processing power of BigQuery or Russia, but design it properly. Meaning, for example, data scientist, they're looking for, huge amount of data to train their models.
I would say that it's really well to the data warehouse and data lakes we have, but they are looking for data in different formats, different sort of grain, different cadences. So building a mod for them, building some sort of subject area or a gated community that they can access it. I think that goes really well, but they need access to historical data. And our storage is cheap, so we we keep the data for how long they want it. Similarly, we for our application teams, they need to do a lot of, I would say, searches and a lot of figuring out the data and also probable needle in the haystack.
And so another sort of design approach you're taking is to publish it in a distributed search engine like Solr, where once you throw the data in that engine, people can then use it to perform searches and do and cross reference the different types of data and then use whatever they need. Instead of building APIs, we have defined API engines or API services where any query can be exposed as a API endpoint by just doing configuration changes, in a configuration file. So by doing those repeatable patterns, we are able to, I think, scale in a serious manner and not have to write a lot of code. If a client needs a new dataset that they want to search on, then we can just try it to solve and then make a run from that. And similarly, API endpoint, you can define new API endpoint. They can define query, and then they have a valid new API service to grab data from. And similarly, on the data science front, again, we are doing through frameworks and models. We define the datasets or define the processing in a framework.
And if you need new data from us, our data that has everything, obviously. But if you need new data from us, then we define your load as a metadata. We express as a metadata, and the engine will then execute that metadata, which in turn is expressed as SQL for the most part, but there are exceptions. And then transfer into jobs and then the jobs and grab the data and automatically load your data parts that you need. So in that way, we are scaling the systems in a very, sort of, logical and organic way, Vian, As the customer needs and the customer base increases, we're not bogged down in writing new code. I don't have a team, so can't necessarily as the mother had mentioned. Right? So we are writing the scalable, repeatable, performant solutions or models where people can just grab data programmatically.
[00:36:58] Unknown:
In terms of the kind of pitfalls that your team has run into where maybe they didn't do enough upfront design or, you know, whether it's in your work at Wayfair or in previous organizations? What are some of the kind of common blind spots or pitfalls that you and your team have run into?
[00:37:18] Unknown:
Yeah. So I think I measured this using print hardware as the first line of defense that always comes to bite you back. Even in cloud, there's so much hardware you can throw, and not thinking through their design is another pitfall where people are just writing code or going in as, like, a Ubercoder, not thinking through the design. I may be repeating myself as a program record, but design is sticking and and not thinking so design enough is another sort of pitfall. And then going running after tools and technologies which are not established themselves and not thinking about the support. So, for example, if I bring in a new Google service called Dataflow, which is like EMR, then that's not a managed cluster. I have to manage the cluster myself. Then how am I thinking about supporting the cluster? What if things go wrong on Saturday at 5 PM?
Where is my support? And who's going to resolve my problem? That means I need DevOps. I need a roster of people who are on call. So thinking through some of those things that in terms of support and maintenance and figuring things out on the fly. 1 other thing that I also wanna point out is that when designing solution, especially data pipelines, people don't think about real inability. They don't think about how to play a pipeline automatically when things go wrong. Most people I've known, they code for the best case scenario. Okay. We have data coming in and does our ETL transformation and then unloading it in target table. But what happens if your data doesn't come in? What happens if your data is missing or incorrect or has gaps?
Or what if you produce a report that goes to senior management but is giving them wrong information? What happens then? What happens if you find out the problem 3 months after it happened? How do you go back and repeat? So having that automated replayability is absolutely essential, which most people I know, they don't, unfortunately, a lot of people do, experienced people. But most of the people, they don't plug into the design when they're designing. And so what happening is they would have to spend huge amount of time manually to correct those data problems and repeat the data themselves, which is a huge sync on all the time. And that time, obviously, they're not delivering value to the stakeholder.
So the first thing that I look forward when I look at the design is what happens when shit hits the ceiling or ceiling and how do you recover from that?
[00:39:57] Unknown:
Another interesting avenue of your experience is that you're helping to run a data engineering meetup for the community local to you in Boston. And I'm wondering, in your work with running that meetup and talking to the folks that are showing up there across across the different organizations? What are some of the recurring themes that are coming up and being discussed and some of the either common pain points or some of the interesting successes that folks have discussed.
[00:40:27] Unknown:
Right. So I would also point out that I run a meetup called Data Engineering Boston, and I also write a blog on meet Data Engineering blog on Medium, which has good number of followers. And I'm also teaching a data analytics course at Brandeis, which is a local university in Boston. So I'm interacting with a lot of people, and it's interesting to see that a lot of the times, the problems they're trying to solve are problems that they haven't solved before. So, for example, most of the workload that people have, they're not living in Red Shift or BigQuery. So the world, they're mostly they're living in, SQL Server, Postgres, or Aurora.
That's the most common data this people are using. So they are having problems in terms of concurrency or performance or scaling, figuring out Things like simple things like, you have a packet file. How do you do schema evolution? But there's a schema evolution feature, but that is expensive. That slows things out. So how do you solve it by design? So then, for the most part, people are having this sort of common problems. And I feel like there is a book that needs to be written. It's begging to be written, maybe someday, if I have time, that kind of explains those kind of things or simple problems that people are having.
And I think they're looking for a support community. So there are some people who are looking to break into this field. But for the most people, they're looking to have support because there's so many technologies, tools, and platforms. It's just virtually, obviously, impossible to be good at everything. So they're looking for expertise when they're struggling, and they're looking to unblock themselves, and they're looking to get ahead in the career. And they're trying to also figure out, oh, I'm a data analyst. How do I become a data engineer? Or how do I become a data scientist?
So they're looking to also see how well they can progress the career. What are the some other courses or technologies or certification they can do to do that. So there are different kind of people out there, obviously, but it's very interesting. At least to me, I was shocked to find in my bubble. I thought the analytical databases are probably the most good, but they're not. It's the SQL Server and the Aloraz and and MySQL server that are obviously more commonly used on the data world. Yeah. It's definitely interesting the
[00:42:54] Unknown:
stickiness that these transactional systems have. And I think part of it is that operationally, they're very well understood. So if you have an application team who can stand up a database and use that for writing a, you know, general CRUD application, they're gonna be able to do the same thing and just use it for their analytical workloads until they start to crush the database because they're trying to do aggregate queries on a row level storage. So
[00:43:23] Unknown:
Even then, I would say these databases have done a good job, at least, like, if you have low terabytes of data Right. They can easily scale. Like, SQL Server, Oracle, or or proposed, they can easily scale up to low terabytes. And and for most company, I would say, even though Big Data is a buzzword, most companies don't have 100 of terabytes. So these data was gonna easily handle that. And and even though they are OLTP, they can easily do reporting, aggregation, and all those things. We can easily solve by design. And, secondly, there are also, like, a sweet spot in between where somebody like AWS Aurora has come up and has offered some of those features where they offer you distributed compute and they offer you storage compute, but they also offer you asset and they also have all the people workflows. So they're offering kind of best of both world.
And they can scale it to, I don't know, 40, 50, 60 terabytes. So then why do you even need direct ships of the world and pay the high cost? Right? Because if you have a mixed workload, then Aurora can easily solve the problem. Why go for a BMW when a Honda Civic will do the job? Right?
[00:44:37] Unknown:
Because driving a BMW is fun. But you have to pay for it. Exactly. Yeah. There's definitely a lot of the kind of I don't know if it's the sort of fear of missing out or just, you know, cargo culting of, oh, I'm doing analytics. So that means that I automatically need to buy Snowflake for my, you know, 5 gigabytes worth of data.
[00:45:01] Unknown:
Age old problem of crushing a fly with a bazooka. Oh, yeah. By the way, Snowflake has the best business model where they don't tell you why they are spinning up clusters, but they they'll have they'll spin up new clusters all the time for your data load workload. And the more cluster they spin up, the more money they make. Right? So I I should have bought this stock.
[00:45:28] Unknown:
Why are they launching a new cluster? Oh, because they, you know, they need to pay their employees stock options or what have you. We should have all bought those stocks. Right? So Yeah. Absolutely. And so in your experience of working in the analytics space and managing a data engineering team, what are some of the most interesting or unexpected or challenging lessons that you've learned?
[00:45:52] Unknown:
So I think as you grow in your career and as you grow sort of in your team organization, you realize that most of the problems are most of the technical problems are relatively easier to solve. And maybe I'm giving you little bit nonpolitical answer here. But most of the problems that are kind of hard not to crack, so to speak, are political problems, organization problems that require maybe a lot more weight than you can bear on. So, for example, in a decentralized company like Wayfair, there is no such thing as office of architecture. Right? So nobody's thinking through data architecture. So all the teams, internally, themselves are doing what they can, but there is not enough weight that is being put upon on data architecture as a discipline.
So as a result, all the decentralized data teams are doing whatever they can. They're doing the best they can, obviously, but but there's no competency. And as as if there's not much gravity there. So I feel like we have missed a step there by not thinking through that. But, obviously, if we do that, then we assume the teams down. If you have a office of data architecture, then office is mandating few things. And that slows people down. Those close team down. And I said, Silicon Valley companies have put lot of premium on Nano Velocity.
So it's kind of chicken or egg. I feel like I still feel like there is a, maybe, happy medium there where we still can have some people thinking about data architecture while not slim things down. And that would give people benefit in the long run.
[00:47:43] Unknown:
As you continue to work through the challenges that you're facing and chart the course for your team and your organization over the coming, you know, months quarters years. What are some of the topic areas that you're most interested to dig into or some of the
[00:48:00] Unknown:
sort of, aspects of the data ecosystem that you're keeping a close eye on? Yeah. No. I think there are a lot of first things that are happening, obviously. So I think 1 thing that we are looking at is how do we break some of those barriers between machine learning and, databases. Right now, we have to build a separate, I would say, a data structure or a data mod or what have you, semantically or curated layer for our machine learning or data scientists. And then they sort of point the models on top of that. So having your model point right to where your analytical database lives and makes that sort of selection within that tool and reduce the barrier and having all those algorithms natively talk to the database.
I think that will be huge. Secondly, I'm also looking at the OLAP space, which used to be big back in the days, if you remember the SSCS cube of Microsoft. And in the big data world, that technology is I think there are a lot of industry players. I think there is Apache Carlin, which is on Huddl space and atScale, which is more on the distributed technology space. So it will be huge to have that technology mature and give people a good multidimensional data structure to play with, especially in the big data world being we have so much data proliferating. So having that technology is good. And then having a solid technology, which can run both the analytical and real time workloads together.
I'm not talking about transaction, but I'm talking the near real time loads. It's, again, something that I'm keeping an eye on. And, obviously, Spark and Hadoop and Python and having, like, a next sort of ETL technology or paradigm, which is more agnostic of platforms and clouds and technologies, also, I'm looking at. And then, finally, I would say data quality. I feel like data quality is kind of the stepchild of data engineering. People haven't focused as much, but we spent so much time looking or worrying about data quality that it's not productive to not do it upfront.
And it's mostly a bill. I can't buy anything right now. So those are some of the things that I'll be looking at.
[00:50:33] Unknown:
So for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:50:48] Unknown:
So I think I mentioned some of the different data quality and detailed frameworks and machine learning organically and all those things. Yeah. So obviously, technology is evolving as we go, Building much more scalable, much more reliable, durable performance systems is as much fun as it is now as it used to be back in the Oracle 7 days. I remember, Mark, my first data warehouse was 700 megabytes. I used to think that's huge, and it's almost funny to see the world right now.
[00:51:25] Unknown:
Yeah. Absolutely. Alright. Well, thank you very much for taking the time today to join me and share your experiences of working in the space and managing a team that's powering a large organization. It's definitely very interesting and constantly evolving problem domain. So it's great to hear the experiences of people like you who are working in it day to day. So thank you again for taking the time and for all of your efforts, and I hope you enjoy the rest of your day. Thank you. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Guest Introduction
Ashish's Journey in Data Engineering
Challenges in Managing Data Engineering Teams
Strategies for Team Motivation and Development
Build vs Buy in Data Engineering
Separation of Storage and Compute
Design Choices and Architecture at Wayfair
Managing Machine Learning Workloads
Common Pitfalls in Data Engineering
Community Insights from Data Engineering Meetups
Interesting Lessons in Data Engineering
Future Trends and Focus Areas in Data Engineering