Reflections On Designing A Data Platform From Scratch

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Outland as an internal tool for themselves.

Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth

$3, 000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. I'm your host, Tobias Macy. And today, I'm going to be sharing some of the approach that I'm taking while designing a data platform. Some of the things I've been thinking about and some of the

lessons that I've learned from the podcast that have fed into those

decisions. So if you are listening to the show, there's a good chance that you're familiar with who I am, but just to give another brief introduction,

I'm the host of the podcast. I've been running this for

about 5 years now.

I've also been running the Python podcast dot in it for about 7 years.

And for my day job, I run the platform and DevOps team for the open learning department at MIT.

And I got involved in data management

through

my work as

a systems administrator, software engineer,

and now a DevOps and platform engineer.

And

in that journey, I've

been very interested in the data elements

and being able to build reliable and performant data systems, which is what led me to create this podcast.

And so in

the past couple of years, I've been thinking a lot about how to

architect a data platform

so that it will fulfill the needs of the organization,

some of the specific constraints that I'm dealing with,

and just generally

the pieces that you need to consider when you're starting down that path,

whether you have an existing data platform or if you're starting to build something out from scratch.

And so,

in terms of the

initial implementation

of what we've been using,

we started off with

installing a system called Redash, which is a business intelligence and dashboarding

system,

open source,

allows

you to write SQL code in their

visual editor and be able to execute it. And

sometimes you can schedule some reports

and it connects up to a number of different data sources. And so we took advantage of that because it was something that we had available. It was fairly straightforward to get it up and running and connect it up to the different systems, but

it didn't scale very well, particularly as we started to need to do more

complex analysis and we wanted to be able to join across multiple different data sources.

So we've definitely been hitting a lot of limitations of that system.

And I've been thinking a lot about how do we want to

architect a more full featured and robust platform to be able to address all of the data needs of the organization

and

how to

make that maintainable from a platform perspective

and accessible from an end user perspective,

which is a lot of the things that we talk about in this podcast, sometimes at very specific component levels, sometimes more broadly. So I just wanted to share

some of the thoughts that I've had and some of the considerations that I'm going through right now as I begin some of this

architectural planning for implementing

these data infrastructure components and how to stitch them together into an overall platform.

And in terms of being able to actually build the data platform, there are a number of different components that go into it.

And at a high level,

there is data integration, which is the extract and load being able to pull your data from the source systems into some

centralized storage layer for being able to do all of the downstream analytics

store it in

file or object storage and build out some sort of a data lake? Are you going to

rely on structuring the data and put it all into a data warehouse?

And then you need a data orchestration layer to manage the

data integration, the data transformations,

any downstream uses of the data.

And it's important to have a robust

metadata repository

to maintain

the record of all of the different components of the system,

the data that you have, the way that it's being used,

auditing, access control.

And once you have all of the data in that central storage system, you need to have some sort of a semantic layer, whether that lives in your business intelligence, or if you are going to

use 1 of these newer systems that are being called the metrics layer or the semantic layer or headless BI.

And all of this

is essentially useless if you don't have some

utility of the data, some sort of a data application that's actually going to be

a external end users or

building machine learning models off of it. And so these are a lot of the things that I'm thinking about as far as

how to stitch together these different pieces and different concerns.

There are a few different

philosophies around that, where some folks will say you need to have a fully vertically integrated solution where you have everything living together in 1 tool so that there's a consistent experience across it. That's definitely something that's very popular, particularly in larger organizations, because of the amount of complexity that already exists. You want to be able to have a way to reduce that complexity.

In the other direction, you have what's being termed the modern data stack, where you have individual

best of breed components where each different

tool will

focus on 1 of these different concerns or maybe have some overlap into a couple of them. So you have this unbundling of the data stack,

but then you also have other organizations

that are working to

repackage that modern data stack, where they will abstract over those different

tools and infrastructure components to be able to give you that consistent experience again, but still be able to leverage some of the innovations and new capabilities that are introduced by these more recent contenders in the space.

And so digging deeper into

some of those specific layers, beginning with data integration, which is where all data

endeavors begin because you need to have some sort of data to work with,

That has typically been extract, transform, and then load, where you need to perform some sort of

cleanup or

initial modeling before you load it into your data warehouse.

With some of the cloud based capabilities and more

recent advancements

in data management layers, that has shifted into the extract and load phase, where you will just

pull the raw data, load it as is into some of these destination systems, maybe do some very light transformation.

And some of the things that you need to think about when you're deciding what am I going to use as that data integration layer

What are some of the sources that you're dealing with? So are you pulling from application databases?

Are you pulling from third party SaaS platforms?

Are you in control of the data that you have? Are you just pulling flat files off of some sort of file share that you're pulling from a vendor or a partner?

And

do you need to deal with

real time

access to data as it changes? So are you just dealing with periodic batches, whether that's on the frequency

of minutes, hours, days, weeks,

or do you really need to be able to process each event as it occurs and process it with as little latency as possible?

Because

each of those capabilities are going to bring in different

infrastructure and architectural

requirements around the entire rest of the data platform.

Some people will advocate saying that you need to start with streaming because

batch is just a special case of streaming, where you're just doing coarser grained events.

But

it also requires

more upfront investment in terms of the sophistication that you're dealing with. For my own purposes,

I'm primarily dealing

with application databases

and

some third party SaaS platforms,

So I'm most likely going to be focusing on a batch approach using something

like the

singer specification

or something like Airbyte.

And in terms of my approach to

selecting the different tools and components, I generally bias towards open source,

both because that is an ecosystem that I spent a lot of time in, and so I feel very comfortable. But also

as

a tinkerer and somebody who works at the platform and infrastructure level and working in sort of the reliability space, I really like being able to have access

to the internals of the tools and the systems that I'm operating so that if something does go wrong, I have it in my power

provided by using a vendor because they will be the ones responsible for a lot of that reliability engineering, so it alleviates a lot of the burden on your own engineering resources.

So there's definitely a trade off to be made there, and that's something that I think about a lot is, am I biasing too heavily towards open source? Should I be pushing this into a vendor, maybe a vendor that's running open source software so that I can

do some of the debugging and be able to provide more detailed feedback to the vendor in the event that something does go wrong.

So that's something that colors

my overall thinking about the platform layer, is that sort of bias towards open source and running it myself.

And so, in terms of the

data integration

layer and some of the choices that I'm making,

my current thinking is to

buy into the singer ecosystem,

most likely leveraging something like Meltano to be able to handle some of the actual stitching together and

monitoring

and generation and tracking of the metadata related to those executions.

I still need to prove out that

implementation to make sure that it fulfills all the needs that I have, but I like that by using

such an open and flexible specification,

it gives me the option to be able to

invest in the long tail of data sources and destinations,

where, if I'm using a vendor such as Fivetran, I'm a little bit at the mercy of what they have implemented.

And because I'm going to be working with applications

that

my team owns and modifies, as well as working with some

open source components

and helping to contribute back to that ecosystem.

By being able to build my own customized

taps and targets, I can

add support for the systems that I need to be able to rely on.

That being said, there are some event stream components to some of the systems that I'm running. So at some point, I will need to consider how am I going to incorporate

streaming capabilities.

Currently, I rely on

batching up those events and publishing them as JSON files to S3, so I'll still be able to analyze them, but at higher latencies.

So

it's a question of,

will those latency requirements ratchet down and introduce the need sooner rather than later to invest in a streaming

environment.

My personal preference there

is most likely

to

use something like Pulsar because of its flexibility

in terms of operational

characteristics, as well as

the interfaces that it provides.

So I like the fact that it has compatibility

offerings for Kafka because of the size of that ecosystem,

but it also supports the PubSub

approach to messaging, such as what RabbitMQ provides with AMQP. So it's just a very flexible

ecosystem.

But that's something that I haven't

explored fully yet, but that's my current thinking on the matter.

And I've just heard a lot of

reports of people running into challenges dealing with Kafka, particularly as it scales, even with using some of these managed solutions, just because of some of the foundational

architectural elements about

how consumers are managed and how that maps to topics and how you need to do a lot more upfront planning and consideration in terms of how you

design your topics and event streams. So having that greater flexibility

in terms of being able to iterate and discover and modify as you go, having to do so much upfront investment in being able to design the system.

Moving into the data storage layer,

there are some additional considerations to be used. So

how is the data going to be used? Is it primarily going to be textual data? Is it something that can be easily structured into a table format? Or are you also going to be dealing with unstructured data sources such

as images, videos,

binary object data, such as PDFs, or if you're dealing with genomics or customized geospatial information.

Personally, I'm mostly dealing in the space of structured and relational data, maybe some semi structured JSON objects.

But there is still a lot of discovery to be done in terms of some of the unknown

data sources that I need to interact with as I start

to scale out the platform. And I also like the flexibility

of being able to store things in a file layer because it gives me the option

having it in

a loosely structured data lake to invest in modeling and

optimizing some of the access around that data by loading it into a data warehouse

or an OLAP store

as a secondary concern without having to make that investment upfront and constrain

some of the downstream

capabilities that I have. So

I'm optimizing for flexibility

at the cost of a little bit more

complexity in the overall

stack

that I'm going to be operating.

TimescaleDB,

from your friends at Timescale, is the leading open source relational database with support for time series data.

Time series data is time stamped so you can measure how a system is changing.

Time series data is relentless and requires a database like Timescale DB with speed and petabyte scale.

Understand the past, monitor the present, and predict the future. That's Timescale.

Visit them today at dataengineeringpodcast.com/timescale.

Some of the other things to consider are What are some of the other tools and systems that you're going to need to integrate with? So,

are you planning on using a

data quality vendor that only works with some of the main data warehouses?

Are you going to be

relying

on SQL as the primary

access mode for managing your data?

Are some of the,

maybe, reverse ETL vendors that you're going to be relying on only available with some of these data warehouses. So

there are some

costs and considerations to be made when you're deciding

whether to

use a data lake or a data warehouse approach,

particularly as the data warehouse becomes the focal point of the

modern data stack.

And so, there are definitely a lot of benefits to be had by using a

data warehouse from 1 of the big vendors, such as BigQuery, Snowflake, or Redshift, because so much of the ecosystem has invested

in working well with those systems by being able to do things like

analyzing the query logs, to be able to auto generate lineage information,

being able to introspect

the table schemas,

to be able to load it into your data catalog. So there are

a lot of companies that are investing in that ecosystem.

That being said, there has been also a movement towards

standardizing

on what the

lakehouse architecture will look like, where

you have your data stored in a file or object storage, and you're building out the storage layer as a data lake, but then you also are able to take advantage of some of the warehouse

semantics and access patterns through

1 of these SQL interfaces to object storage, such as Presto or Dremio or Trino.

And so

given my choice to use a

data lake storage approach, where I'm using

S3

as the storage location for my files,

I'm planning on using Presto or Trino as the

SQL interface so that I have that data warehouse

interface where I can treat everything

through SQL

and be able to lean on tools such as DBT

and some of the other very sort of SQL optimized workflows,

but still have the flexibility of access

to work with those

file objects through other systems, whether that's just pure Python or Dask or Spark.

And the other aspect of the storage story when working with data lakes is that you need to think about what is the actual

format of that data. So am I just storing it as new line delimited JSON?

Am I storing it as binary blobs? For any relational data, I'm focused on using Parquet because it's a very

well defined and well supported format that gives

you some of the advantages of columnar data stores for being able to do aggregate analytics. So it makes it much more performant when you're working with these Presto or Trino or Dremio systems.

And in order to be able to

stitch all of these things together, you need to have some data orchestration layer, where

it will handle the

mapping of what are the dependencies

across these different either tasks or datasets,

and

what are the periodic

schedules that I need to be able to manage, particularly when you're working with these batch systems.

And

the data orchestration piece

is becoming

1 of the most important

choices that you make because it becomes the

control center of your entire data platform,

and it is where

a lot of the business logic will live as far as

what are the mappings of sources to destinations,

what are the transformations that need to be applied and when. It will have a lot of the information about the lineage

of your data as it goes from these sources to destinations.

The question of whether

the primary

consideration is task oriented or data oriented will influence a lot of the capabilities that you have.

So a lot of the earlier generations

of workflow orchestration systems were very focused on task sequencing, where you would say, I need to do this task

where it is agnostic to the actual

data that it's working with or the execution,

and then I need to do this other task, it will handle that dependency graph to ensure that these tasks are run-in sequence,

but without adding in additional logic and additional systems,

you're not going to have any concrete guarantees about

whether the data that you care about is actually going to be processed properly or whether

there are

consistencies in terms of what the outputs of 1 stage are and the expected inputs of the next stage. So you need to, again, add in a lot of extra logic upfront work to make sure that those different task stages are compatible with each other.

And some of the more recent generation

of systems

are focused on this very

data oriented approach, or sometimes data asset oriented approach.

So you are able to

encode in the logic of the task flows

what are the types of data that I'm working with, both as inputs and outputs for each of these stages,

And then it will be able to give you some early warnings

when you're developing your flows if the output of 1 task is not going to be compatible with the input of another task.

And it will also give you insight as to

when different data assets

are created or modified, where an asset could be

CSV file in S3, it could be

a table in a data warehouse, it could be

a dashboard in a business intelligence system.

So being able to get that native view of the actual

assets that you care about at the end of the day versus just whether tasks executed on time is very valuable.

As I'm sure folks who listen to this podcast are aware, I

have invested in the Dagster ecosystem

because I think that they do a very good job of being able to

bring that

data native

aspect to the fore, but there are definitely a number of other great systems out there. So

definitely encourage everyone to go and evaluate

those different tools to see how well it suits your own use cases.

Some of the other things to think about when you're choosing a data orchestrator

is

who is going to be managing the workflow logic. So there are a number of different languages, different approaches, where some of them are very code and software engineering heavy, such as Dagster.

Others

are optimized for

people who understand the business domain but don't want to dig into the code, where they have visual modeling of being able to build your different flows.

So I know that prophecy is a system that's focused on that, where they're building this

low code or no code

interface to being able to build your data orchestration, and then it executes on Spark.

Other things to think about are what programming languages are your team familiar with. So,

Daxter,

Prefect,

Airflow are all very focused on the Python language and ecosystem,

which has grown to become

1 of the

main languages

used in

data management and data processing.

But there is also still a heavy

component of

Java developers and Scala developers because of systems such as Spark and Hadoop.

And so those are things to think about when you're deciding on what your data orchestration layer is going to be, or who is going to be managing orchestration

layer?

Is it something

orchestration layer? Is it something that you and your team have the capacity to be able to deploy and maintain?

Or do you want to go with some

vendor managed or hosted solution for those systems?

And then also, as with everything, what is the broader ecosystem

around that tool?

Is there going to be

general

support

and

accrued industry knowledge about how to do different things with that tool? So Airflow has definitely become

1 of the main contenders there, where

it is a very popular tool and has been around for a while, and so there are a lot of people who understand

how to run it, what are some of the quirks, how do you deal with some of the different challenges that might come up, what are some of the useful design patterns?

Whereas some of the newer contenders are still going through that process of figuring out what are the best practices around how to deal with this system, how to work with it as you scale, both in terms of

data and execution, but also in terms of organizational and logical complexities.

Another piece of the data platform that's very important and has been gaining a lot of attention

recently

is the metadata layer. So

that can take the form of data catalogs, it can take the form of

lineage

graphs.

Different

tools and systems will have their own

concepts of representation of metadata,

but it's very important to think about how are you going to manage the metadata of your overall platform to be able to get a holistic view

of how

data is being

used, what are the different data sources and destinations and assets that you have.

And, you know, there's also different

types of metadata where there's the metadata

that is the table schemas and

column definitions for

information in your data warehouse, but then there's also metadata about

who accessed the system at what time, what queries were executed,

what were the

steps taken to get this record from

the application database that it was created in to this dashboard, and what were the manipulations that were provided on it,

how long did it take for all those things. And so it's definitely very important to consider

what are the uses of metadata that you are going to be relying on and that you need to be able to surface

to end users and to operators of the platform?

And what are the different ways that you are able to

extract metadata from the different pieces that are working

with this information that you care about? And how easily can you

transmit that into a centralized metadata repository.

RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and to build an identity graph on your warehouse, giving you full visibility and control.

Their SDKs make event streaming from any app or website easy, and their state of the art reverse ETL pipelines enable you to send enriched data to any cloud tool.

Sign up for free or just get the free t shirt for being a listener of the data engineering podcast at data engineering podcast.com/rudder.

There are a lot of different tools out there right now. Some of the main contenders

from a very sort of holistic

metadata

writ large perspective

are DataHub and OpenMetadata.

There is also the Open Lineage project with Marques as the reference implementation for being able to track lineage specifically and being able to understand

what are the

tasks and operations that have performed on data over its lifecycle.

And there are a whole host of data catalog options out there for being able to track your data assets,

what are the access patterns of the data,

and being able to understand what is the overall popularity of a particular set of data resources so that you know what to invest in,

what to

focus on when there is any sort of a data outage, things like that.

So those are

what I consider to be the main

foundational elements to a data platform.

Other things that you really need to be thinking about, too, are data quality monitoring,

what a lot of folks are calling data observability, to be able to know

when does something go wrong in your data platform,

how are you going to identify

where it went wrong, how it went wrong, how to resolve it. I haven't focused yet

on

that specific stage of it because I'm still in the early planning phases and initial implementation of some of these more core aspects.

But that's definitely something that I've been thinking about as I start to plan this out.

Most likely going to be investing

in using something like

Predexpectations because it is an open source tool, it's very flexible, it gives me a way to

start writing

the contracts that I know that I care about. But then there are also a lot of things that you need to be able to

monitor and be alerted on

that are unknown unknowns, and that's where systems such

as Anomalow or Bigeye or DataFold come in to be able to help with some of that automated

detection of

sources of errors that you might not know that you need to check for until it becomes an issue.

Once you have all of your data

in this storage layer, you have the orchestration, you have metadata about it, you also need to think about,

for reporting purposes

or

machine learning purposes, what are the actual

core

data models that I care about? How am I going to do what some might call master data management or semantic modeling?

How am I going to make sure that the definitions that I have about some of these entities in my system are shared across all of the different downstream consumers, which is where

the semantic layer comes in.

Until recently, a lot of that semantic modeling happened

directly in businesses intelligence tools, but there has been a recent shift towards headless BI or the metrics layer.

And there are a number of different tools out there

from the open source

ecosystem.

Some of the ones that I'm considering

are Metricool

or Kube. Js because they're both very

flexible in terms of being able to

use them to power

data APIs

and write additional

logic on top of those systems.

But I'm still very early in those phases. I haven't invested a lot of time and energy in deciding on that piece of the stack until I get to I have all my data in a centralized repository and I started doing some exploration about

what are the modeling considerations that I need to have, what are the entities

semantic layer is in place, then you also have different data applications that you can start to build. But those are

towards the higher end of the sort of data hierarchy of needs, where before you can think about

what are the exact

data applications

that I'm going to build, I need to know what is the data that I have to know

what can I build with it?

And so those are a lot of the core considerations

and thoughts that I have around the data architecture

elements of it.

And then as far as implementation,

where I'm largely starting greenfield,

and there are a number of unknown unknowns as we start to explore the data, scale the capacity of the platform.

And so, in order to

keep from getting lost in a sea of decisions

and end up building,

you know, an incredibly sophisticated platform that nobody actually cares about

and doesn't provide any value,

I'm

starting with what is a specific

end user need

for data that I know that I have,

because that will give me a way to

start with the implementation,

explore a lot of these unknown unknowns,

and start to discover what are the edge cases, what are the

problems with some of the choices that I made early on that might become

exacerbated as we scale

out. And so

by selecting a single

focused

end user need with this data saying, Okay, this is the actual

initial

data application that I need to provide.

That gives me a way to say, okay, these are the data sources that I need to collect.

I don't need to collect everything from everywhere. I just need to collect

these pieces of information

into a centralized location, figure out the modeling and

semantics

of that data so that I can then provide it to an end user,

validate whether my choices

from an architecture and infrastructure perspective

still hold?

And how well am I able to provide a

self-service

interface

to a data analyst or a data scientist to be able to

work with that data.

And then based on that feedback that I get from the end users and from myself and my team and

just the overall experience of working through that problem, I can then say, okay,

these are the pieces that did work well, I'm definitely going to stick with them and invest more in them, or these are the pieces that didn't work well, and I actually had to

replace it with this other piece or add in an additional

component that I hadn't considered upfront.

And so that gives me a way to

mitigate some of the potential risks

fairly early by doing an initial

research spike of

not scaling to production capacity, but just doing a very narrowly scoped, you know, how does this work from end to end? And then what are some of the

limitations in terms of actual integrations across these different tools that I didn't know until I really started to

get into the guts of it and work with it and

use that for something that an end user is going to rely on.

And

so, there's a lot of stuff that I've learned from this

podcast, as far as

the considerations to be made, the availability of different tools, some of the ways that very

sophisticated

organizations

are working with data.

But the other thing to consider as you're going through your own journey is,

what are your own specific constraints? What are your own specific needs and

capabilities as it pertains to these different tools? And so

it's valuable to not just buy into

whatever a vendor is pitching, but think about

how is it going to fit with the rest of my

infrastructure,

with my team, with my own experiences.

And so,

I definitely appreciate everybody listening.

This is the 2nd time I've done a monologue for the podcast.

It's an interesting

way for me to try to organize my thoughts on the matter. So I appreciate any feedback that you have as to

whether you found this valuable, if there are

any additional questions that you have about my own thinking and experiences, if you have any thoughts that you would like to share that

maybe you think would be worthy of an episode to discuss from your own experiences.

Definitely

grateful for everybody who listens to this podcast

every week.

And so

for anybody who does want to get in touch, provide feedback, follow-up, make suggestions, I'll add my contact information to the show notes.

And in terms of the final question of what I see as being the biggest gap in the tooling or technology that's available for data management today,

The last monologue, I focused on

the lack of some of these application frameworks

having a first class support for providing

data

extraction and integration

capabilities

out of the box.

This time, given the focus on

architecting a data platform,

I'd say that 1 of the biggest gaps

is in just sort

of widely

available and popularized

information about

how to actually navigate this

expanding landscape of the data ecosystem.

There are definitely a lot of valuable resources out there

and a lot of personal experiences

of building some of these systems, but

it's definitely still difficult to be able to say, okay, these are my specific circumstances.

What is my best option for being able to build a data stack? But there are definitely ways that you can do that. There are things like the

AWS

Lake Formation.

Google has their own opinionated approaches.

Different vendors will have their own

different end to end solutions. Databricks has Delta Lake.

But it's always valuable to think about how does this fit with my own understanding

and my own requirements. And so I still think that there is a bit of a gap in more

vendor

and platform neutral

ways to think about your end to end approach of data infrastructure.

So

definitely appreciate

all of you listening.

Glad I was able to

share some of my thinking on the matter. Please provide feedback if you have found it helpful. And thank thank you, and have a good rest of your day.

People listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links