Building And Managing Data Teams And Data Platforms In Large Organizations With Ashish Mrig

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking,

object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

The only thing worse than having bad data is not knowing that you have it. With Bigeye's data observability platform, if there is an issue with your data or data pipelines, you'll know right away and can get it fixed before the business is impacted.

Bigeye lets data teams measure, improve, and communicate the quality of your data to company stakeholders.

With complete API access, a user friendly interface,

and automated yet flexible alerting, you've got everything you need to establish and maintain trust in your data.

Go to dataengineeringpodcast.com/bigeye

today to sign up and start trusting your analyses.

Your host is Tobias Macy, and today I'm interviewing Ashish Marig about his path as a data engineer. So, Ashish, can you start by introducing yourself? Yeah. Hi, Tobias. Thanks for having me on the show. So hello, everyone. My name is Ashish, and I'm a data engineering manager at Wayfair.

So I started my data engineering journey about,

I would say, 20 years ago

when

they used to call it database development

and when Oracle 7 was probably the cool thing on the block.

So I kind of accidentally

got into this and

sort of started my career as a senior developer

architect and got into big data.

And for the last 8 plus years, I've been

managing

data engineering teams. Mostly, we're doing cloud first deployments,

working with AWS and GCP. And I would say I work with pretty much all different types of database. So like RDMS, MPP,

distributed, or Hadoop or NoSQL. So, obviously, there are probably hundreds of databases. But in a category wise, I'm a little bit most of them. And,

I feel like I have a good grasp on how things are aligned

in the data engineering space right now.

You mentioned that you've been working in data management for

quite some time now. I'm wondering if you remember how you first got involved in the space and what it is about sort of the data ecosystem and the problems that are involved that keep you interested and motivated to

stay working in this area?

So I think it's more sort of organic

evolution. As I

got into the seniority,

I found myself helping other team members and

and it organically grew into a team leader and then as a data manager. So I don't know whether this this was, like, a pre designed or predestined thing on my part.

So my first big break was in a company called

TiVo.

For those of you who are

younger, who may not know that TiVo used to be the cool digital video recording box in nineties.

They were still around about 10 years ago and then I joined them. We were in the business of monetizing

TV viewership data. So we would get anything you do on television set up as we'll get that data and then we would

aggregate it and then make it into product that we're selling to advertisers and to use to do. So I think that was my first big break where I was managing the big data team. That was the scale of the data was huge,

and we were dealing with terabytes.

And that was the first sort of AWS cloud deployment, you know, working with Spark and Hadoop and all the distributed technologies.

So I think that's how I got into the managing

the data teams. And I found this was a very rewarding experience. I could bring a lot of experience, a lot of

my sort of working in the benches knowledge into the

table and help teams and

companies build a scalable and reliable architecture that is probably harder to do in data than other space.

So as you mentioned, you're now leading a data engineering team at Wayfair, which is a fairly sizable company. And I'm sure you have a pretty significant volume of data. And, you know, you're probably hitting the maximum on all 3 of the v's of variety, velocity, and volume.

And I'm wondering what are some of the kind of main topics that take up a lot of your time and attention in terms of being able to keep your team on track and be able to

build data products and build systems that are able to manage the sort of scale of information that you're working with?

So

as a manager of the engineering team at Wayfair,

1 of my job is to look

ahead. Meaning, I'm looking

2, 3, 4 quarters ahead of the team. My team is working on

the current road map and delivering day to day things that people need. But I'm looking ahead and trying to not only figure out what problems we'll have, but also solve those problems

in advance.

And that's obviously not a easy job to do. So the thing that I

typically tell people is

think of the data engineering as a big lumbering

ship.

You can change the direction, but it takes

long time to change your direction. It takes a lot of effort to change your direction. So once the ship is set on a course, we're

moving on that course and and it takes a lot of time and effort to change and chart a new direction. So we have we have to be careful

and plan our road map before we start working on it. So this is maybe different than application development where people are

doing things fast and failing. Right? I can't

start working on, let's say, Cassandra

and do things incorrectly and then fail and then come back and do things differently. Because

I need to make sure I provision the right amount of nodes, the right compression, the right compaction,

the right sort of algorithms,

and process the design data models in place before I can even write 1 piece of code. So we're thinking about

front loading the project or the initiative with

a lot of thinking into the design,

How the data should be organized

and what are the tools and processes that will go into

moving the data. So once we build that process,

changing it or managing it as a big task. So that's where the ship analogy comes in.

So my

big thing is to make sure

that stage is set for the team to

come in in next quarter or the following quarter, do the things they want to do, or they are required to do, and not be bogged down by some of their details. So I kind of and the

what is the word they use? I'm kind of the

forward team or the and last team that is kind of doing the reiki and setting the stage for the team to come in and execute.

So in continuing with the ship analogy, as you're trying to kind of captain the ship and chart a course, what are some of the kind of icebergs that have tended to pop up in your way that you need to work around and be able to help kind of steer in advance of so that your team doesn't run into a sort of catastrophic situation?

As a data engineer,

we are

almost always traveling to different worlds.

The first word is we are engaging the stakeholder and

giving them what they need. Right? Another day, we are solving a business problem. Right? We exist because there is a business

that is being run and their business needs data from us. So that is our primary

goal and need. And the second leg that we are standing is the technology work, and I don't wanna tell you this, especially in the big data land, the amount of technologies and services that are proliferated

is humongous. So making sure we are setting the sales course for the right technology design architecture

is critical. Otherwise,

if you get bogged down in in the second, sort of, we're not delivering on the first 1. Right? So that goes hand in hand in glove. 1 example that I'll give you and the iceberg that I always try to avoid is

the so called data migration.

So

we went I think almost all companies did this. All the companies, maybe nineties,

early 2000, we were all on prem.

Starting 2010

or maybe around that time. We all said we will go to AWS

or Azure or in cloud. So we have spent enormous time going to cloud.

And now we are in the cloud, but we are also

kind of stuck

with 1 application, 1 vendor. So in our case, we are probably not stuck, but we are using BigQuery.

But tomorrow, let's say,

a new technology comes around, which is

15 times better than BigQuery, then

we're looking into 1 more data migration. And data migrations are time consuming things and they're hard. And they're even harder to undo in cloud, especially if you're moving vendors.

So I spent

6 months doing data migration from on prem to

Google Cloud,

where we do not deliver a lot of value for the stakeholder. So I don't wanna do that again, and that's not good for the business. So we need to think about

future proofing our architecture in a way that protects us from any new technology that or less embracing new technology without doing all the cost we do. So that is 1 iceberg

that I always

try to look out.

But

the flip side of it becomes a political

issue and the sense that why are you trying to solve a problem that doesn't exist yet? BigQuery works fine. So and there's no nuclear on the block. So who knows? So we will do BigQuery

foreseeable future. So that is, I think, the mind field and the iceberg that we have to, as a manager or leaders, we have to sort of straddle and

and work through and make sure that we are not hitting to them and continue to team with us.

The main thing too with the cloud adoption is that, well, it does give easy scalability of, oh, I can, you know, pay for what I need right now and just scale up as I need to with when you do actually start to hit that scale up point, then the costs somehow seem to manage to scale superlinearly.

So

you need to figure out, okay. How much data am I actually going to need to use down the road? What are my query patterns going to be? How am I going to mitigate some of these, you know, unforeseen

expenses that come about because of the fact that the cloud is so dynamic?

I think you hit the nail on the head. So back in the days when we were on prem Oracle shop,

we used to spend so much time

thinking about the design architecture and tuning our queries because we have 1 box, 1 server, which is fixed on memory. We can't scale scale it. It takes 2 months to get a new 1. And it takes

almost act of God

to ask for patching and provisioning and all those things. But now, everything is on demand. So you're right. I've seen,

especially in the new

sort of I don't wanna make it generational wall, but new generation of, you know, engineers that the first line of defense is, oh, let's throw more hard disk or let's throw more memory at the problem. Instead of tuning your queries or thinking about the design or solving the problem with the design,

people are solving it by throwing more resources, more compute at the problem

and which

invariably

ends in only 1 way and that way is

with an email from your

head of the infrastructure or whoever is managing resources that your enforcement is out of

the roof and you need to validate.

Until somebody

puts

guardrail,

I think it's a natural human tendency that we

take the path of least resistance. And

that is another iceberg, I think, you in the technology that we, as a manager,

always kind of asking people and coaching people and guiding engineers to make sure that we do not grow in order of code

design, architecture, and the hardware is the last resort and not the other way around.

But, again, that also becomes,

political potato where

people lead things fast.

And and doing it the right way is time consuming. So it's a matter

of finding those battles and explaining those.

But fortunately,

Longyearfield is a rich company, so we can afford all those hardware that Google

is giving us.

From the team management perspective,

because you're working with all these data professionals, you have these,

you know, high impact projects that can make or break the business.

What are some of the useful

lessons and strategies that you've gathered in your time working at Wayfair and other companies to help keep your

engineers motivated and on target. And because of the,

you know, continually

changing nature of the ecosystem, helping them to understand

what are the valuable ways to spend their time, like, what are the lessons that are worth learning with these new technologies, and how do you identify the cases where it's actually just a flash in the pan and it's not actually bringing anything new? It's just, you know, putting a new coat of paint on something that's been around for decades.

Yeah. That's a really good question. And I think it's a problem that every engineering manager, whether data or

any different type of software engineering

discipline has to or has been experiencing.

So

engineers,

by nature, are a

very opinionated bunch

and very high maintenance.

So I think 1 of the key thing is we need to

make sure that we

are worrying about their development sorry, career development

and making sure that they are thinking in the right direction.

I feel like that gives them lot of job satisfaction.

So what I tell people is don't run after the next kid on the block, but have your basics.

Solidify your basics. What I mean by that is

understand

the core concepts of data warehousing,

data modeling,

and big data. They are still relevant whether you're using Redshift

or BigQuery or Synapse or whatever else.

So having that core knowledge and understanding of arranging data and what are the pitfalls or what are the in terms of understanding what makes or

breaks a data project. I think that is what it's where in gold.

But at the same time, thinking about

their career development, giving them a chance or giving them room to work

in latest set of technologies,

and making sure they're

marrying that chance with solving a business problem.

Again, what is where the goal 1 example that I'll give you is as a part of credit open in 1 of my employees, I asked them to work on building a data quality framework. As you can imagine, the data pipelines, which are managing data,

have personally, a lot of data quality problems. And we don't want to find out retroactively

after

even has passed from the client. So we do have this data quality framework design

using, I would say, a very innovative metadata driven data model

using Python

to

build a data model where anybody can go in and define their data quality checks and then

log that checks into some sort of tables and then point

dashboards

on top of that for easy consumption. So that whole project is a very ambitious thing, but it gives people a chance to flex the data modeling muscle,

work in Python packages, and

work with something called InfluxDB

or for event logging

and build the dashboards. But at the same time, we are also solving a business problem where

the customers are now able to see why something went wrong

or what was the extent of going wrong by doing the data quality measurements

proactively on the data pipeline. As the data pipeline gets published, we also measure those metrics.

And in some cases, we are even stopping the data pipeline because we define thresholds

that if the data crosses this threshold,

then let's not even publish that data. Right?

So that, I think, is a very satisfactory, sort of, project that people

like to do or they especially get engineers because that not only solves their business problem, but also

expands their technical horizon

and keeps them current and keeps them, sort of, happy with the with the work.

As you're talking about the, kind of, data quality

work and the challenges there and building something in house, it also brings up the question of,

as an engineering manager and as somebody who does have a lot of background in the data ecosystem, I'm curious

what your thoughts are on the

modern data stack as it were, and what are the actual

useful pieces to pull out of that? What are the pieces that are not worth spending the time as a relatively mature organization on trying to kind of rearchitect

around because you run the risk of, you know, dying a death of a 1, 000 cuts from all of the different SaaS bills and, you know, some of the sort of cost management aspects that go into, you know, buying into all these different managed systems and, like, the kind of build versus buy equation?

Yeah. I think there are 2 dimensions to this question. 1 is

the build versus buy

in general for data engineering. Secondly,

for a big company like Wayfair. So let me address the second 1 first. I think Wayfair, similar to Amazon, has,

for the most part, adopted the model that if you can build it, then why buy it? Right? I think Amazon has taken to extreme because they can. They're big. They don't use any commercial product because it doesn't scale for them. But, also, obviously, we are not as big.

We have things that industry leaders

like Jira

or

Git or, obviously, Google

Cloud Platform Services,

BigQuery's other word. Whoever is the industry leader, I think we are not

reinventing

the wheel there. But

to your point, in terms of

the current data stack, I think,

it

lends well to solving

big data problem.

But it doesn't lend well to solving

the process problem. What I mean is

most of the products that are being marketed right now are marketed as a way you don't have to run a single piece of code, like Snowflake or

these new cubes or

new Apache

sort of products.

But

we do want to encode because we are an engineering organization. So we are not afraid of any code. A data stack which lends to the technology

technical landscape landscape than

marketing to the business landscape. Right? So, for example,

there used to be 2, like, Informatica or DataStage

used to, do well in the client server era. But in this big era, I would see a single ETL tool that lends well

to

writing your ETL

as

a decoupled application and then point it to any computer, whether we want to run that ETL as BigQuery process

or as a Spark process

or as a Huddl process. Maybe that's too ambitious, but I don't think that's, also the random possibility. Like, if I express my business

process as a SQL, and that SQL can transform to hide SQL or Spark SQL or Ubiquet SQL or Redshift SQL. Why can't we build a tool like that which is geared towards

the developers?

They're having few startups in this era,

but

we're yet to see the best to come. Right? So I think that is maturing and that is a business opportunity that is begging to be taken.

Building something that is platform and

Cloud agnostic.

Yeah. That's definitely been a recurring conversation,

particularly

over the past month or 2 for me is the

kind of question of how do we manage to build,

like,

these database agnostic

processes where,

you know, DBT has taken off because of the fact that it's easy for data analysts to take their SQL skills and be able to level up into

more sort of repeatable workflows by pipelining these different transformations.

And that's great until you get to the point of, okay. Well, now I need to do the same thing on Snowflake and BigQuery. And, oh, now I'm also gonna need to run it on Redshift. And so then it's a matter of, okay. Well, now I need to abstract this in Python and,

you know, maybe then I'm pulling things into Panda's data frame so I'm not taking advantage of all the processing capabilities of the data warehouse. And

so now I'm back into the, you know, the old problem of I need to figure out what my distributed execution framework is so that I can do all those data processing out of band versus being able to make use of the data warehouse that was supposed to be my saving grace.

Right. And and to throw in 1 more variable in this thing is

if we want to do our data processing in real near real time or real time,

then some of these analytical databases don't lend well to near real time data processing. And then we're talking about

if you bring in NoSQL databases like Bigtable

or Cassandra or

Edgebase,

then that's a

a whole new problem that we have to solve because they are

not very conducive to SQL.

Building a SQL layer on top

is clunky at the best, I would say. I don't know if people have solved this problem.

Marrying the analytical data with the

events

data and storing 2 different,

workloads

at the same or processing 2 different workloads at the same time and then make sure

they're all in synergies fit together. Right?

For most part, what I've seen is people are replicating

their data models in analytical space to

the NoSQL space

in a different, obviously,

format because NoSQL has a different modeling techniques.

And then using the NoSQL for the lookups

for a low latency

reads like API endpoints.

And then using analytical for, obviously, more heavy duty processing.

But then that means we're picking up processing and picking up models and different data.

And I don't see any

cohesive

technology right now that sort of streamlines it and makes it all in sync with each other. But this is an opportunity that's

gone a big deal. Yeah. There are definitely some of these problems that are,

you know, as old as data itself, and there are some problems that we seem to keep creating new ones every time we solve old ones. And I'm wondering

what are some of the

sort of newly generated problems that you're currently tackling and some of the ones that have been consistently problematic for you throughout your career?

1 of the things that we, maybe, for past few years is

separation of storage and compute.

I think I've talked about this earlier as well. You

can definitely put every data you have in, in Redshift

or whatever tool that you're using, but then that ties you with that vendor and that technology. And

and if something new comes along, then, like, Presto or Druid or something, then you're not able to use that. And then if you do use it, you're ending ending up duplicating

your data

and your processing.

So 1 of my, sort of, pet peeve is to make sure I'm

creating my data lake, or they call it as lake house or a mesh or whatever you call it,

in

a agnostic

layer like s 3 or GCS

and then point my computer. So that is problem that repeatedly keep on

experiencing and it's not that it's a difficult problem to solve. It becomes more

of the organization wants to spend that much resources and time

to build that foundational

layer

in an agnostic platform and then appoint computers on top of it. In other words, are we ready to

be strategic and tactical?

And in most companies, what I've noticed is people are more tactical because

there's a lot of pressure from the stakeholders, obviously, but they also

do cycle land. You need to show what you accomplished

for the business. More like

a quarterly

stock report

or quarterly earnings.

So

you have to be continuously showing that you are doing something and not just

building this, quote unquote,

pipe dreams

that we will realize it maybe towards end of the year.

So I would say less a technical problem, but more organizational. But how much organization is willing to back this?

A lot of these Silicon Valley organizations are decentralized.

There's not a lot of, I would say, centralized thought going into the process. All the most of the teams are doing their own things.

In that

sense, people are taking the path of least resistance, I would say,

and then doing things as they see fit. So

but the flip side is that

by doing that, like, companies like Amazon, Google, they have put so much premium on velocity.

Doing things fast.

I'm arguing that is not always the right, especially in the data world, where we have to, maybe,

pay our debt initially and

do our due diligence and build our

design and architecture and our lake houses

before we can deliver a single

report to the client. But you can't have them. You can't have

velocity

and then have the right item. So we are kind of in that cycle where we are doing paying this debt again and again. So that's, I think, my number 1, sort of, thing that I keep doubting. Secondly, what I encounter is

writing

ETA jobs, 1 of ETA jobs. So if a stakeholder client gives you

some requirement,

you go and build the ETA job in table. And what the end result is,

your code base and your

systems probably fit really fast. The complexity goes really, really fast.

Because

if you have 1, 000 reports, then you end up having 10, 000 tables and 1, 000 new jobs, which is which gets almost impossible to manage. Right? Especially, if you made it complex using some sort of bullshit technologies

like Apache Beam, which is

more a collection of bugs than a technology.

So I am

they are I don't know what they're trying to solve by using both batch and real time in 1 flow.

But, anyway so my, sort of, problem that I keep solving is

do not write ETLs as 1 of things. Don't write 1, 000 ETLs. Instead, write 1 ETL framework

which

should be configurable, which should be matter driven, and then define your business logic using

SQL, which are easy to manage and maintain.

And then the framework then spits out the jobs and

does the work for you. So

kind of like what machine learning does, but obviously not as smart.

So don't write code, but write frameworks

and build in intents.

In terms of the architecture

that you have been building out in your time at Wayfair, what are some of the

kind of design choices that have been useful and some of the decision making that you've made around the kind of

technology choices and where to invest your time and energy?

Back in the days, we used to especially in the analytical world, we used to

expose data in

a bulk format.

And most of our design architecture choices are geared towards that to give you large amount of data

reasonably fast

manner. So whether by table or a file

or what have you. But I think the the game has changed now. So

people want not only

the analytical data, but also

a bit more in transaction banners. So people want to see how much, for example, in Wayfair case, how

what was the change in the Wayfair price for a given product? And they wanna know

why a service, why an API, or why a push notification. If the price of this product

goes beyond $100, for example,

I need to know it. Right? So that's not a bulk use case. So we are not designing for 2 different workloads,

for analytical workloads and for

bit more transactional workloads. I'm not using the word transaction in that sense. We're we're doing some business transaction, but more in a event manner, people want to know. Or maybe send them a Kafka message or something.

So the big ways of the world don't scale well for low latency querying.

And so we have to now design both

the analytical and the

low latency workloads.

The other thing that has sort of evolved is

how do we present data to the clients?

And there are so many different ways to do that. Only 1 metric that I heard

is far still true is most people use Excel.

But there are proliferation of dashboards and reporting software.

And when I first started Wayfair, we were doing a lot of designing of these reports,

doing the lot of front GUI part of it. And that is probably not the best

use of data engineering time to design those

widgets and buttons

and layouts.

So 1 of the design paradigm that we have adopted is to move out of

doing that and make our customers self self reliant. So we follow self-service BI model

where we give them the analytical layer

or semantic. We build a semantic layer for them and then they can use

the Looker or Tableau, so the word,

to

drag and drop the data they need and build

the widgets and and reports themselves.

And for the customers who are not as savvy and who are mostly reliant on

Excel, we have what we have done is we have built

a multidimensional

OLAP cube

using a distributed cubing technology called AtScale.

And that works very well because that has a native support for Excel and then clusters into pivot table and through which the tablets can do whatever they want or need.

So by doing, I think, these few things, we have

refocused the data engineering teams to our core competencies, which is the data modeling, the design, the API pipelines, data pipelines, data quality.

So I think in that way, we have moved the team away from those sort of low performing tasks.

Today's episode is sponsored by prophecy dot io, the low code data engineering platform for the cloud.

Prophecy provides an easy to use visual interface to design and deploy data pipelines on Apache Spark and Apache Airflow.

Now all the data users can use software engineering best practices.

Git tests and continuous deployment with a simple to use visual designer.

How does it work? You visually design the pipelines and prophecy generates clean Spark code with tests and stores it in version control. Then you visually schedule these pipelines on Airflow.

You can observe your pipelines with built in metadata search and column level lineage.

Finally, if you have existing workflows in AB Initio, Informatica, or other ETL formats that you want to move to the cloud, you can import them automatically into prophecy, making them run productively on Spark. Learn more at data engineering podcast.com/prophecy.

And another interesting

aspect of the current landscape

of the data ecosystem

and particularly data engineers and data platforms

is

we've moved beyond just the need for

answering analytical queries with our data to the point where we actually have to start managing

more

large scale machine learning workloads

and maybe being able to handle reinforcement learning and being able to manage the feature engineering and low latency queries and, you know, real time aggregation of that information. And I'm curious,

how are those demands

starting to manifest in terms of the work that you're doing and the ways that you are thinking about

steering the ship to be able to enable both the analytical use cases that have been, you know, revolving and I'm sure are largely mature and stable

alongside

the requirements that various machine learning workloads are placing

on your team and your infrastructure?

There are not so many different types of stakeholders that I don't think we should even call ourselves an analytical platform.

So machine learning and data scientists are obviously the key ones. But funnily enough, the engineering teams

to which we source our data,

they are also our customers because

they need to also consume some of our data that we are

gathering and building, as a historical source of truth

to do some of their data processing or, transactional processing, application processing that they're doing. So that's also another use case where they we need data in more transactional manner. Well, whoever is our customer, the underlying

paradigm still remains the same. That design is the king. That is what I keep telling people. Don't be reliant on processing power of BigQuery or Russia, but

design it properly. Meaning,

for example, data scientist, they're looking for,

huge amount of data to train their models.

I would say that it's really well to the data warehouse and data lakes we have, but they are looking for

data in different formats, different sort of grain, different

cadences.

So building

a mod for them, building some sort of subject area or a gated community

that they can access it. I think that goes really well, but they need

access to historical data. And our storage is cheap, so we we keep the data for how long they want it.

Similarly,

we for our application

teams,

they need to do a lot of,

I would say,

searches and a lot of figuring out the data and also probable needle in the haystack.

And so

another sort of design approach you're taking is to publish it in a

distributed search engine like Solr,

where once you throw the data in that engine,

people can then

use it to perform searches and do

and cross reference the different types of data and then use whatever they need.

Instead of building APIs, we have defined

API

engines

or API services where any query can be exposed as a API endpoint

by

just doing configuration changes,

in a configuration file. So by doing those repeatable

patterns,

we are able to, I think, scale in a serious manner

and not have to write a lot of code. If

a client needs a new

dataset that they want to search on, then we can

just try it to solve and then make a run from that. And similarly, API endpoint, you can define new API endpoint. They can define query, and then they have a valid new API service to grab data from. And similarly,

on the

data science front,

again, we are doing through frameworks and models. We

define

the datasets or define the processing

in a framework.

And if you need new

data from us,

our data that has everything, obviously. But if you need new data from us, then we define your load

as a metadata. We express as a metadata, and the engine will then execute that metadata,

which in turn is expressed as SQL for the most part, but there are exceptions.

And then transfer into jobs and then the jobs and grab the data and automatically

load your data parts that you need. So in that way, we are scaling the systems in a very, sort of, logical and organic way, Vian,

As the customer needs

and the customer

base increases, we're not

bogged down in writing new code. I don't have a team, so can't necessarily as the mother had mentioned. Right? So we are writing the scalable, repeatable,

performant

solutions or models where people can just grab data

programmatically.

In terms of the

kind of pitfalls

that your team has run into where maybe they didn't do enough upfront design or,

you know, whether it's in your work at Wayfair or in previous organizations?

What are some of the kind of common blind spots or pitfalls that you and your team have run into?

Yeah. So I think I measured this using print hardware as the first line of defense that always comes to bite you back.

Even in cloud, there's so much hardware you can throw,

and not thinking through their design is another pitfall where people are

just

writing code or going in as, like, a Ubercoder,

not thinking through

the design. I may be repeating myself as a program record, but

design is sticking and

and not thinking so design enough is another sort of pitfall.

And then going running after tools and technologies which are not established themselves and not thinking about the support. So, for example, if I bring in a new

Google service called Dataflow, which is like EMR,

then that's not a managed cluster. I have to manage the cluster myself. Then how am I thinking about supporting the cluster? What if things go wrong on Saturday at 5 PM?

Where is my support? And who's

going to resolve my problem? That means I need DevOps.

I need a roster of people who are on call.

So thinking through some of those things that in terms of support and maintenance

and figuring things out on the fly. 1 other thing that I also wanna point out is that

when designing solution, especially data pipelines, people don't think about

real inability. They don't think about how to play a pipeline automatically when things go wrong. Most people I've

known, they code for the best case scenario. Okay. We have data coming in and

does our ETL transformation and then unloading it in target table. But what happens if your data doesn't come in? What happens if your data is missing or incorrect or has gaps?

Or what if you produce

a report that goes to senior management but is giving them wrong information? What happens

then? What happens if you find out the problem 3 months after it happened? How do you go back and repeat?

So having that automated

replayability

is absolutely essential,

which most people I know, they don't, unfortunately,

a lot of people do, experienced people. But most of the people, they don't

plug into the design when they're designing. And so what happening is they

would have to spend

huge amount of time manually

to

correct those

data problems and repeat the data themselves,

which is a huge sync on all the time. And that time, obviously, they're not delivering value to the stakeholder.

So the first thing that I look forward when I look at the design is what happens

when shit hits the ceiling or ceiling

and

how do you recover from that?

Another interesting

avenue of your experience is that you're helping to run a data engineering meetup for the community local to you in Boston. And I'm wondering,

in your work with

running that meetup and talking to the folks that are showing up there across across the different organizations? What are some of the recurring themes that are coming up and being discussed and some of the either

common pain points or some of the interesting

successes that folks have discussed.

Right. So I would also point out that I run a meetup called Data Engineering Boston, and I also write a blog

on meet Data Engineering blog on Medium, which has good number of followers.

And I'm also teaching a data analytics course

at Brandeis, which is a local university in Boston.

So I'm interacting with a lot of people,

and it's interesting to see that a lot of the times, the problems they're trying to solve are

problems that they haven't solved

before. So,

for example, most of the workload

that people have, they're not living in Red Shift or BigQuery. So the world, they're mostly they're living in,

SQL Server, Postgres, or Aurora.

That's the most common data this people are using. So they are having problems in terms of concurrency

or performance

or scaling,

figuring out Things like simple things like, you have a packet file. How do you do schema evolution? But there's a schema evolution feature, but that is expensive. That slows things out. So how do you solve it by design? So then, for the most part, people are

having this sort of common problems. And I feel like

there is a book

that needs to be written.

It's begging to be written, maybe someday, if I have time,

that

kind of explains those kind of things or simple problems that people are having.

And I think they're looking for a support community.

So there are some people who are looking to break into this field. But for the most people, they're looking to

have support

because there's so many technologies, tools, and

platforms. It's just virtually,

obviously, impossible to be good at everything. So they're looking for

expertise

when they're struggling, and they're looking to unblock themselves, and they're looking to get ahead in the career.

And they're trying to also figure out, oh, I'm a data analyst. How do I become

a data engineer? Or how do I become a data scientist?

So they're looking to also see how

well they can progress the career. What are the some other courses or technologies or certification they can do

to do that. So there are different kind of people out there, obviously, but it's very interesting.

At least to me,

I was shocked to find

in my bubble. I thought the analytical databases are probably the most good, but they're not. It's the SQL Server and the Aloraz and and MySQL server that are obviously more

commonly used on the data world. Yeah. It's definitely interesting the

stickiness that these transactional

systems have. And I think part of it is that operationally,

they're very well understood. So if you have an application team who can stand up a database and use that for writing a, you know, general CRUD application,

they're gonna be able to do the same thing and just use it for their analytical workloads until they start to crush the database because they're trying to do aggregate queries on a row level storage. So

Even then, I would say these databases have done a good job, at least, like, if you have low terabytes

of data Right. They can easily scale. Like, SQL Server, Oracle, or

or proposed, they can easily scale up to low terabytes. And and for most company, I would say,

even though Big Data is a buzzword, most companies don't have

100 of terabytes. So these data was gonna easily handle that. And and even though they are OLTP, they can easily do reporting, aggregation, and all those things. We can easily solve by design.

And, secondly,

there are also, like,

a sweet spot in between where somebody like AWS Aurora

has come up

and has offered some of those features where they offer you distributed compute

and they offer you storage compute, but they also offer you

asset

and they also have all the people workflows. So they're offering kind of best of both world.

And they can scale it to, I don't know, 40, 50, 60 terabytes.

So then why do you even need direct ships of the world and pay the high cost? Right? Because if you have a mixed workload,

then Aurora can easily solve the problem.

Why

go for a BMW when a Honda Civic will do the job? Right?

Because driving a BMW is fun.

But you have to pay for it. Exactly.

Yeah. There's definitely a lot of the kind of I don't know if it's the sort of fear of missing out or just,

you know, cargo culting of, oh, I'm doing analytics. So that means that I automatically need to buy Snowflake

for my, you know, 5 gigabytes worth of data.

Age old problem of crushing a fly with a bazooka. Oh, yeah. By the way, Snowflake has the best business

model where they don't tell you why

they are spinning up clusters, but they they'll have they'll spin up new clusters

all the time for your data load workload.

And the more cluster they spin up, the more money they make. Right? So

I I should have bought this stock.

Why are they launching a new cluster? Oh, because they, you know,

they need to pay their employees stock options or what have you. We should have all bought those stocks. Right? So Yeah. Absolutely.

And so in your experience of

working in the analytics space and managing a data engineering team, what are some of the most interesting or unexpected or challenging lessons that you've learned?

So I think as you

grow in your career and as you grow sort of in

your team organization,

you realize that most of the problems are

most of the technical problems are relatively easier to solve. And maybe I'm giving you little bit nonpolitical answer here.

But

most of the problems that

are kind of hard not to crack, so to speak,

are political problems, organization problems that

require

maybe a lot more weight than you can bear on.

So, for example, in a

decentralized company

like Wayfair, there is no such thing as office of architecture. Right? So nobody's thinking through data architecture.

So all the teams, internally, themselves are doing what they can, but there is not enough weight

that is being put upon on data architecture as a discipline.

So as a result, all the decentralized

data teams are doing

whatever they can. They're doing the best they can, obviously, but but there's no competency. And as as if there's not much gravity there.

So

I feel like

we have missed a step there

by not thinking through that.

But, obviously, if we do that, then we assume the teams down. If you have a office of data architecture, then office is mandating few things.

And that slows people down. Those close team down. And I said, Silicon Valley companies have put lot of premium on Nano

Velocity.

So it's kind of chicken or egg.

I feel like I still feel like

there is

a, maybe,

happy medium there where we still can

have some people thinking about data architecture

while not slim things down. And that would

give people benefit in the long run.

As

you continue to work through the challenges that you're facing and chart the course for your team and your organization

over the coming, you know, months

quarters years.

What are some of

the topic areas that you're most interested to dig into or some of the

sort of, aspects of the data ecosystem that you're keeping a close eye on? Yeah. No. I think there are a lot of first things that are happening, obviously. So

I think

1 thing that we are looking at is how do we break

some of those barriers between machine learning

and, databases.

Right now, we have to

build a separate, I would say, a data structure or a data mod or what have you, semantically

or curated layer for our machine learning or data scientists.

And then

they sort of point the models on top of that.

So

having

your model

point right to where your analytical database lives

and makes that sort of selection within that tool and reduce the barrier

and having all those

algorithms

natively talk to the database.

I think that will be huge.

Secondly, I'm also looking at the OLAP space, which used to be big back in the days, if you remember the

SSCS cube of Microsoft.

And

in the big data world, that technology is I think there are a lot of industry players. I think there is Apache Carlin, which is on Huddl space and atScale, which is

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links