Database Refactoring Patterns with Pramod Sadalage

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at data engineering podcast.com/linode

and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.

And go to data engineering podcast.com

to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page, which is linked from the site.

Your host is Tobias Macy, and today I'm interviewing Pramod Sattalaj about refactoring databases and integrating database design into an iterative development workflow. Pramod, if you could just start by introducing yourself.

Yeah. Hi.

Pramod Sidalge. I've been with Thoughtworks,

for almost 19 years and a couple of years before that, mostly in the data space.

And,

almost maybe 20 years ago, I got into data space because I was optimizing an application

to perform better, and that's how I got interested in data, databases,

and how you can improve its performance.

And first time I joined Thoughtworks in 1999,

we were doing Agile for the first time. Like an application that was being done in waterfall,

we shifted to doing Agile. And, Martin Fowler and Ward Cunningham

were our coaches.

And we had to take this,

big application that was, like, a big upfront design application

and move to, like, an iterative design,

kind of application. And that's where,

1 fine day, we went from, like, 600 plus tables that were designed down to, like, 9 tables.

And from there, built it back up, iteratively.

And that's how I got interested in this whole notion of, building a database in an evolutionary fashion

and all the other factors that

influence that decision as well as the design.

And over the years,

I have

interacted with many clients, many teams, learned from,

all the other teams that I've interacted,

dabbled in NoSQL

databases, how they affect this. But the basic,

thing that I am really interested in is the attractive design of database

and all the surrounding aspects of that as it affects the team, as it affects deployment, as it affects design,

as it affects the productivity of the team and related stuff. And

just a clarifying point before we go further, I know that you had a coauthor

on the book Refactoring Databases, but I just wasn't sure sort of what the relationship was there, if you were the primary author and he was there as a sort of editor. Or

no. We both were, like, equally coauthors.

Scott Ambler was also talking about,

evolving databases and doing evolutionary

design. And,

Martin basically connected both of us together,

so that we could put our thoughts together and write.

And so you first coauthored the book, Refactoring Databases,

back in 2006. So I'm wondering if you can just describe the state of software

and database system development at that time and why you found it necessary to write a book on that particular subject.

Yeah. So like I said in the intro, the first project,

which was,

maybe about 30 plus developers

and

like, analysts and,

testers

and PMs and all combined. It was, like, about a team of 60 people.

And we were on this journey to build,

build a leasing application,

and that included back end as well as the front end. And

as we moved from,

iterative

as we moved from, like, a bigger front design to an iterative,

way of working,

what we found is, first of all, the developers

need to get a lot more freedom on how design happens on the data data side. They also need

to be more proactive in how they can,

get a database to work with,

get a get a test data set up,

get

a much more better feedback on what kind of design is working, not working, and stuff like that. So what and I was the

primary DBA slash data architect

or data modeler on that project. And that's how I got into this space. So what we realized is we need to have a lot more collaborative environment. Like, I can't expect someone would send me a email or have a meeting with me about something they're working on. Like, that's not gonna,

work because then I'm blocking their work. Right? So the card is blocked on the wall. The other part was if someone wants something,

if I can do it in a command line, I can actually build a tool and give it out to developers, and they can do it themselves. So automation was a big thing. The continuous integration

practices that we are trying to put in meant that

whatever the developers did had to be available in the continuous integration environment, like a totally separate environment. Then how can I make the changes that were done by the developers

also

available in the CI environment? So there's a bunch of, like,

version controlling of database. So those practices came up. So we discovered over a bunch of discussions

and bunch of talking with other team members as well as Martin

is these are practices

that are relevant to everyone in the industry that's trying to adopt this agile iterative or evolutionary way of working.

And at that time, agile was a very new world. Like, XP was very new for everyone.

And then we started

going out and talking to everyone, like, all of, our team members. I think, if I recollect correctly, it was myself, Peter Shu, Aima Dalshami, and a bunch of others who were well versed in this.

Talked about this process, talked about,

how to do these things.

Also,

push the envelope a little bit in terms of thinking about how we can do these things differently. And maybe it took, like, 2, 3 years of talking, writing,

articles, blog posts, and things like that that,

Martin basically approached, myself as well as Scott Angler to see if he can put this thinking

into a patterns kind of a book. So the book came about because there's certain things about processes that you need to follow. And there are patterns of refactoring that needs,

like, a a a a way to communicate

amongst the team itself, amongst people and things like that. Right? So if you go back to, like, refactoring, the code refactoring book, or if you go back to, like, the design patterns book, the whole notion was I can see a certain pattern, and people understand what I mean. People understand what it entails.

People understand

the trade offs I'm making

to make that particular design pattern or to implement that design pattern.

Similarly, we took the approach of to

there are, like, certain patterns of refactoring your database. Let's codify them. Let's give a name to them and give

the trade offs when you do that kind of particular refactoring.

Now to do any kind of refactoring, either it's code refactoring or database refactoring, you need to follow certain processes like CI, CDE.

CI is an example of that. Or test driven development is an example of that. Or collaboration, how do you collaborate with the rest of the team, is an example. So the book is basically divided into 2 separate parts. The first part is more about process. Like, how do you how do you make this work in a team environment? What is more important? How do you interact? How do you collaborate?

And things like that. And this second part is basically a series of refactorings. I think there are about 70 or so,

that are split up in some categories. I think there are 5 or 6 categories that they are split up in. And the no idea was once this is out, once we can talk to each other in a particular pattern oriented language, it's much more easier to understand.

And later on, like maybe 2008, 2009,

a bunch of tools came about that implemented that language itself. Like Liquibase is an example of that,

which says

gives you a refactoring name and does all the automation, like, what all things I need to do underneath

for you, like, if it's a DSL kind of approach.

And I know that some aspects of this is going to be a little obvious, but what are the characteristics

of a database in particular that make make them more difficult to manage in an iterative context than source code?

Yeah. Like, I think state is a big thing. Like in a code refactoring, for example, I can refactor a method

and all the things that are calling that method. If this, if the API of the method doesn't change, then there's nothing much to do. I compile, make sure all the tests run, add more tests if needed,

and it could just go along on its own.

But if you are giving out an API to the public,

to the, out to the public, then you need to be very careful when you're refactoring that API. Right? Like, you cannot just change the attributes on it and stuff like that because the callers

that are used to using it in a certain way also need to change. And you cannot leave them,

with a broken API. So if you take that same philosophy to a database refactoring,

because data has state in the database, right, like a customer's name is has to be a customer's name when the refactoring is done.

Or if the customer has 2 outstanding invoices after the refactoring is done, you still need to have 2,

like, outstanding invoices with the correct amount. So that state is very important.

So that's 1 part. The other is the dependencies.

Right? Like, for example, in a enterprise or in any kind of database that's more than,

I would say, a week old,

there are a bunch of dependencies on the database. Like, people start using the data in the database without the application knowing about it.

People start,

calculating stuff off of the database, reports,

BI tools,

some ETL tools kinda start hooking into the database

and then start pulling data out. And if you refactor the database underneath it, all those dependencies,

are basically

out there,

with wrong assumptions or broken or not working

or giving wrong results. All kinds of dependencies break. And that, I think, makes it very critical when you refactor a database

that you keep the legacy,

API,

of it. So I tend to think of, the database schema as a API to the data itself. Right? So if you are changing the API,

then you need to provide certain ways

so that people can get to it without having to change their application. Right? So if I own the database,

I go ahead and make the change,

but there are other

applications that are still reading from the database. If they can keep reading the way they were reading before,

then I'm fine. I can go change my database and not worry too much about others being affected. A very good example of this is,

let's say, previously, I had customers and I called the table as customer.

And now because of change in business and change in all kinds of stuff,

or my business domain, I am calling them clients.

Right? So I go ahead, rename the table from customer to client. Like, very simple refactoring

and things like that. But if other

applications, ETL tools and things like that are referring to the table as a customer,

now they won't work anymore because I've renamed the table from customer to client. Right? So in these refactoring

patterns that we talked about,

we give out examples on how you can provide support for legacy.

So, so just to take that same example again, I can rename the customer table to client,

and I can put a synonym called customer

on the client table. So if someone comes in and looks for the customer table, the database will point you to the client automatically, like internal

API mapping some ways,

and you still get the data. 1 other technique is you can put a view on top of the client table, call it the customer,

and people who come looking for the customer table can still get the data. Right? So doing these kinds of techniques, now you have made a refactoring, may change the design of the database or changed whatever you wanted to change. But all the dependents

are

working fine or they are not broken. Right? And over a period of time, then you can collaborate with these team members, external team members, saying the customer table has now been renamed to client.

So over a period of time, as you are pushing new code into production on your side,

please refer to the new table, and I may,

like delete this customer interface

in a month or in 6 months or whatever your deployment timelines are. So it gives a safe way

to get to a new place without having to, like, break things for others.

Yeah. I think that part right there is 1 of the major

takeaways

is that because of the fact that databases are used beyond just the application that they're created for, particularly with ETL pipelines,

the concept of evolving the schema can break other people's workflows.

And so 1 of the questions that I have coming out of that is how much involvement

should be expected

for the

data engineers or the database administrators

in the development workflow to ensure that the changes that developers are making are going to maintain those APIs and ensure that they don't create

unnecessary breakage or that they're evolving the schema in a way that is maintainable

within the context of those other,

systems that are depending on that existing schema?

That's, I think, is a great point that you bring up is I have always been talking about, like, generally like, traditionally, we have seen the data people as

who take the output of the developers and then make it work. Alright? Like, the developers change stuff, and then the DBAs look at it and then implement it, and then maybe come back and say, oh, you you need an index here. You need to do that and that kind of stuff.

And, like, way back in from 1999, we have flipped that around is instead of developers coming to me with their output,

basically, it's the

person like the DBA or me going and talking to the developers when they want to make change. Right? So you are basically getting ahead of the change

and collaborating with the developers when the change is happening. So that's 1 big change in the way of working. So you collaborate earlier on in the cycle,

so that you can give feedback as well as provide techniques

on how to make this schema migration work.

Right? And when you are doing this feedback and collaboration, you can give

all the feedback you want, all the techniques you want so that the older schema doesn't break. Right? Because developers

know a lot about the change that needs to be done. Right? Because they are getting the story to work on or the task to work on in an iterative way. But the DBA is also involved in the production,

side of the shop, And they know the size of the data. They know where else this may be used. They also know all the optimizations that they have done, like indexes or disk striping or whatever else that happens in the operational line. So they know that. So it's immensely valuable when the DBS sits with the developers and pairs on these changes.

Right? So a typical workflow on a story that we developed and have refined since then is

that a story that has been taken on for development, story or a task that's been taken on for development by a dev by a developer payer or a developer

to work on. If it involves any kind of persistence change,

they basically come talk to me, right, or the data person. Or I go sit with them, they come sit with me, doesn't really matter. But we talk through the options of what needs to be done and then actually

write that data migration itself, right, and then check it in. So once it's checked in, it's available for everyone.

There's no code being changed by the developer yet to access that new change, but the change is already done on the data side. So that 1 gives

us, a way to make sure that all old code still works. Right? All the tests still work. All the persistence still works. And the developer can now go make the change that is necessary to access the new stuff.

So it kind of gives you a a very easy way to transition from old to the new side of things. This collaboration really helps, like, person to person collaboration. Like, we have done this,

sitting side by side at the same table or,

using, like,

online tools where we collaborate with each other. So that's the first part. The second part is every change you make, right,

has to be, first of all, checked in, has to be, like, version controlled. And when you make a change, like code change, for example, you can make a change. If something's wrong, you can make a make a change again in the same piece of code and stuff like that. That is not how the database change works because once you make a change, that change has already been applied

to the database.

And you can't change that script again to,

so that it can rerun again. That won't work because the table names already changed. And if you say, oh, instead of customer, let me re from customer to client, let me rename it from customer to party. But the customer table's already called client, so there's no customer to rename. Right? So there's a sequence of operations that need to be followed so that,

whenever you are making a change, it has to be a new change, not change the existing change. It's a little confusing to understand in the beginning, but think of it this way. When you change the state, the state has already changed. So you can't rely on the old state.

So you have to rely on the new state. So every time any change you want to make, it's a new change or a new delta.

And that delta has to be self sufficient or complete in itself. Right? So an example of that would be, let's say you are adding,

a new column to a table, and that column needs to have some data based on other columns in the table or based on other tables in the database.

So when you do that, your change should include adding the column

and the

script or the data or DML script to put data into that column. So that's a complete change. You can't just add a column

and then say, oh, let's not worry about data right now. That's not a complete change. So

we push a lot for that complete change. And the reason for that is when you push for a complete change,

you can deploy your application

anytime you want. Right? Because in a CICD pipeline, your application should be deployable

at any point of time. And if you don't have that

DDL as well as DML in the same change, your application is not in a state that can be deployed at any time. So those 2 or 3 practices we push,

a lot so that you get a you get from end to end collaboration,

CICD,

version control

aspects of that. These 3 things also help a lot on the DevOps side of this, which is a more newer movement. But we have been doing that so that the application is in a deployable state at any given time. So you mentioned a few times

the concept of versioning the database. So I'm wondering if you can just call out some of the tools or practices that are available for people to be able to maintain those versions and distribute them among everybody working on that application.

Sure. So let me talk about the practices first, and then I'll give you a list of tools that are available.

The practices are any change you make has to be checked in, available for other people to look at

and

apply to their own version, own copy of the database. So that's the that's first bit of the version control. The second bit is

whenever you are applying anything to the database, once it's checked in, you cannot go back and modify that change. Right? Because it's applied to a database and you can't go back and change.

The third part to that is

all of these changes, when they are put in the right sequence,

give you the end state as of right now. Right? So let's say you started

with

state x and there are a sequence of changes you did. If you apply those sequences,

sequence of changes in the order that they happened,

at the end you get

state y, for example. Right? So you have to preserve that sequence.

Now there are a bunch of tools that let you do all of that is they basically operate on the model of I have a sequence of changes and some tools use,

like numbers, 1, 2, 3, 4, to keep track of that sequence. Some tools use time stamps,

like this particular time this change happened. And the next time something happened, then you keep track based on time. Some tools let you decide let you name the changes

and then decide the order of those changes.

And some tools just do apply, like, I have applied this. I have not applied this. The other thing all of these tools do is keep the state of the changes that were applied to the database in the database itself. So there's they generally create a change log,

kind of a table in the schema that you are working with or the database that you're working with and keep track of what all changes were applied

and what all were not applied. So this helps during deployment time is like, let's say I have applied

changes,

numbered from 1 to 10 and there are 2 more changes

that the team members created, 11 and 12. So the tool looks at the database says I have applied 1 to 10 and in the file system or in in version control, I have 11 and 12. So I have to apply 11 and then 12. So that's how they know what is applied, what is not applied.

So a bunch of tools that allow this kind of, things are Flyway,

Liquibase,

DBMaintain.

Rails migration

is an example of this. Python has some tools. So there are a bunch of tools that do this,

but, like the more popular ones are Liquibase, Flyway,

DBMaintain.

And over the past several years,

the prevalence of data abstraction layers in the form of object relational mappers and object document mappers have made

the

creation and design of databases

more accessible to developers and put a lot of power in their hands. So I'm wondering,

how you view the,

so I'm wondering how you have seen that impact the overall practice of schema design and evolution.

Certainly, it has it has changed a lot. Like, especially the ORM

mappings or even, like, the way Hibernate

or

or certain tools let you, like, put annotations

and do this 1 to many designs is is much more easier on the developers.

And at the same time,

it's actually

a very good thing for the

data people to understand these tools. Right?

So many times in the industry, you find people who are good on the dev side or good on the data side.

And nobody's interested in the space in the middle, from the ORM

to the time you actually process data. That space in the middle, if you are good at that space, you can talk to the developers as well as you can talk to the DBAs.

So that space, I think, is really important to understand. When I do a 1 to many mapping in my code, what does it mean on the database side? Right? Am I doing this this,

am I doing certain things that are not efficient, for example? So there are certain things that can be improved upon if you collaborate with the DBA type people so that you can improve on query performance, you can do some things. But at the same time, it's very important for the DBA type people to understand what does it mean when I'm doing a 1 to many mapping in Hibernate.

Right? So it's it's, it's a knowledge gap between the 2 sides,

the developers and the data people that needs to be bridged. And it's it has affected,

like, the developer,

type people are just, like, I'm I'm just gonna process this, and the data side people are, like, this doesn't follow proper design and that kind of stuff. So some of that friction, I still see around,

but I think it can be alleviated

by proper,

knowledge,

building.

And a lot of the conversation we've had so far and a lot of the original focus of the book was on

relational databases,

and I'm curious

how the overall practices and principles

that we've been talking about are affected by the use of non relational storage systems, whether that's a document store or a graph database or anything along those lines?

I think

a lot of principles still apply,

but the techniques and technologies have changed a lot. The NoSQL field

is so vast, like, because,

even in, like, a document database

of the way a Mongo behaves versus a way a orient DB behaves or a couch base behaves are different. Right? So the tools and techniques

need to change according to the technology, and there's no standard

interface as such for,

people to build tools on. So you see a bunch of tools that are built specifically for Mongo

or a bunch of tools that are specifically built for Cassandra.

Similarly, a bunch of tools specifically built for Neo 4j and things like that. So

it's it's much more scattered as compared to the relational land.

But the principles still apply. Like, for example, if you take a graph database like Neo 4J,

if you change

the relationship

type, right,

all the previous relationship types are now

dangling relations and you cannot traverse on them because you have changed your relationship

type in code,

but you have not migrated the data over. Right? So their state is still in the database.

Just the way you were affected by it has changed.

Like, a very good example of this is,

let's say I have

a column called name,

for

a product table. Right? So product dot name is a column. Name is a column name. Product is a table name. And if I change

name to something else, right, let's let's just say I call it description or I call it,

category name or some something different. Right? In a relational database, that immediately affected everyone.

Right? The moment you push that change into the database, it affected everyone. In and so the bad part of that was that change is immediate, affects everyone. The good part of that is I know something's broken. Right? So so if you relate it back to, like, compile time problem,

if something doesn't compile, I know immediately. Right? But in a non compiled system, you can still check it in and not if you don't have tests or stuff like that, you're not hitting that part of the code. And when you actually hit that part of the code, the runtime interpreter will complain. So

same kind of parallels can be drawn into, like, relational and non relational databases.

For example, in non relational document databases, for example, in the code, I change the attribute name of a document that I'm persisting

in

MongoDB.

Right? So from now on, instead of product dot name, I have product dot description.

Fine. I'm writing everything. Everything's good. But when I search for documents where the description matches,

let's say, a laptop, I'm only gonna get the new documents I persisted.

All the old documents

had the name attribute,

and you're looking for

a description attribute. It'll not find those old documents.

And you will run into this logical mismatch of I was looking for certain documents and I'm not finding them

because the state

that was there in the database

has still remained the old 1 and not migrated to the new 1. The advantage in a NoSQL database is I can lazily migrate this data over a period of time. In relational databases,

when I change a column name or when I change something about the database structure, it's immediate for everyone.

But in NoSQL databases, I can change. And over a period of time, I can migrate that data like product dot name to product dot description over a period of time. It doesn't have to be instantaneous. Right? The flip side of that or the the thing that developers need to remember is

the name is now in 2 places. It's in product dot name as well as product dot description. Right? So I have to fetch from both sides and give a unified view

to the user.

And when I'm writing, I need to make sure I write it in the new way. So if I pull pull out a old document that add product dot name, when I write back, I need to make sure I write product dot description. Right? So this lazily, over a period of time, you can migrate

the data over, but you need to be cognizant of the fact that my data is in 2 separate attributes.

So this this adds a lot of

complexity for the developers, but at the same time gives you much more freedom to move a little faster than what was available in relational databases. Yeah. So it seems like the relational systems push a lot of the responsibility

of maintaining the state and maintaining the

consistency of the APIs to the database itself with views and stored procedures and database triggers, whereas a lot of the non relational systems push that complexity up to the application logic

and makes it the developer's responsibility more than the responsibility of the person maintaining the data system.

Exactly. Right. And and I think, if you if you look off it,

most of the NoSQL databases are called scheme on read. They're not scheme on write. Relational databases are schema on write.

Right? They're more optimized for schema on write versus non relational

databases are more for schema on read. So at read time, you can determine what you want and make a make out a structure out of that. Right. And for a little while, people were trying to refer to them as schema less, but that's only true

actually, it's really never true. There there's always a schema. It's just a matter of where it's enforced.

Exactly. Right. So it's it's not schemaless.

It's it's more of schema, where does this schema lie. And here in NoSQL

databases, this schema lies in the application code. That's what you're saying the structure looks like. And because it is in code, you have much more freedom.

And because you have much more freedom, there's higher likelihood of you missing something

and then not having the ability to reach to data that you have already persisted.

And this brings us back to what you were talking about with creating tests to ensure that the API of the data layer is meant is remaining the same during the course of a refactoring.

So I'm wondering if you could take some time to talk about the different methodologies available for creating these tests and maintaining them and

some of the

differences

in how you would test that depending on the workload that the that the database was built for, whether it's an OLAP or OLTP or data warehouse, etcetera.

Sure. So from a testing perspective, like, if you think about the ORM layer and below, generally, these kinds of tests are integration tests. They are not unit tests, but they are more integration tests.

And we have generally returned tests

that exercise

the domain object that is interacting directly with the ORM,

framework that in turn is directly interacting

with the database. Right? So you're calling it's like a full stack call that happens. And what we are trying to see is

can I persist an object? Can I read an object back?

Can I update an object? Can I delete an object? And things like that. Right? So what this does is for every object, if you have this kind of test setup, if you if someone goes in and changes an object and

some of these tests break,

1 is it gives you feedback that somebody some other object is dependent on this test. So sometimes tests break because I wanna make a change, and

breaking of a test is a good thing that tells me these are all the places where this is being used so I can go fix it. The other part is other side of that is sometimes I make a change without thinking about the consequences.

And all this breaking test will tell me, oh, I need to think about this a little bit more deeper before I make the change. So I can revert my change and then do this stuff. So 1 thing,

I consider this as experiment

space. Right? Like if as a developer,

if I have my own schema to work with

and I'm making my own changes, I'm making running all the tests before I check-in.

If something breaks, it is breaking within my schema. I'm not affecting the rest of the team. So that's 1 other practice that we push a lot is, generally, if you see in big enterprises, the whole team is working on 1 schema, like a development schema. And all the changes they make is in that common schema, and they develop against that.

If I make a change in that schema

that is not good or I'm renaming something or something

that affects the whole team

and

I don't like it or the change is bad or I made a mistake or whatever that stuff is. It affects the whole team. And in this kind of space,

if you are scared of making mistakes,

you generally don't tend to experiment

to your heart's content.

And when you don't tend to experiment to your heart's content,

my feeling is you don't come up with the optimal solution. You need to, like, try out a couple of things 2, 3 times and then come up with the right solution. If you are if you don't have the freedom to try

multiple times,

you basically make the safest choice and just go for it because it doesn't affect others or it doesn't break others. So, generally, we push for the practice of every developer

gets their own schema,

and they work within their own schema. They experiment within their own schema. When they are comfortable with their change, that's when they check-in. They change. Alright?

So that notion

of working within that space

is what is we have been talking a lot about to make things comfortable,

to make things safe, and at the same time, come up with optimal solutions.

And 1 of the main things that you mentioned

in the early chapters of the book is the idea that developers should have access to sandbox environments

for being able to

experiment with the changes to the database that they're making. And 1 of the insidious issues that almost everyone has run up against at some point with databases is that the data that's in your production system is invariably going to cause different behavior than what was in your QA system. And so I'm wondering if you have come across any

workflows

or development patterns that help to propagate the oddities of your production data down to lower level environments to ensure that the changes you're making are going to work when they're ultimately released to a production context.

Yeah. Certainly. Like, I think the newer dockerized

versions of things is helping us a lot. So maybe there are 2 or 3 things to talk about here. 1 is when developers are developing, you don't really want a big dataset for them to work with because it slows them down. If you are if they are, like,

thrashing their schema away and rebuilding it, if it's a really large schema, it takes a lot of time to rebuild. So you want to reduce that time. So, generally, what we do is give a bare bones schema

and bare bones set of data for them to work with, like

sample customer,

sample,

set of reference data, and things like that so they can go faster. The next step on top of that is for QA to

do proper

testing, even for business people to see proper results

for features that are not in production yet, they don't have data for it. Right? So you have to make up some good data for them. So that's the next set of data that we look at, a richer set of data, a more curated set of data

that has,

some of these functional oddities, like a customer that has a longer name or a product that's been around since, like,

1907,

like, age wise and that kind of stuff. So a bunch of, like, oddities of like that. So QA people can test the limits of the system. Then the next test you want is performance testing. Right? So sometimes,

there are 2 or 3 ways to look at this. Performance testing, I can take production data and then put it in. Of course, if you are in a HIPAA or a PCI

or some kind of

restricted environment, you can't necessarily pull production data. So you have to anonymize the data, and there are tools available to do that. The other side other side of SHOP is you can take your QA database, right, or your QA dataset, and extrapolate that data. Right? So if I have 10 customers, I'm gonna multiply them to, like, a 1000000 customers or something like that and build this dataset.

And then you can run,

migrations on top of that. You can run

performance testing on top of that and things like that. So we have had,

a lot of success when in in extrapolating the dataset. There are also tools available to do

data generation, like Thoughtworks open source, tool called Bobcat

that basically generates

sample data,

given a a specification. Right? So it's a DSL type specification.

You give that specification and run this tool. It'll basically generate you a bunch of data. So you could use that kind of dataset

to generate or to put data in the database. So you need to look at this in a multiple access. 1 is ease of developer during development

time, ease of QA for testing,

ease of,

functionality testing by business people, ease of performance testing, and ease of also making sure you understand how much

time this migration is going to take in production, for example. Right? There are also a lot of people who who migrate on replicated environments to make sure that this this migration will happen in the requisite amount of time. Right? So you can have a node, take it offline,

apply like a replicated node, take it offline,

apply the migrations to them, see how long they're gonna take. And do I need to optimize the migration?

Can I bear with that amount of, churn or things like that? And if you're okay with that, then that's fine. If not, then can I optimize the migrations themselves? Right? So things like that affect a lot. A very good example of this would be, let's say I have a customer table that has 10, 000, 000 rows in them. In the developer's dataset, maybe there are 5 rows.

They just add a column and write an update. It runs pretty fast. You take it to production.

You add a column on a 10, 000, 000 rows table. That will run itself for, like, 30 minutes. You can't do that. Right? So then the update itself will take more longer.

Can you live with that? And if you can, fine. If you cannot, then you have to think about alternate ways to do that. And getting feedback of this migration is going to run for 3 hours versus this migration is going to run-in 2 seconds is important.

And, generally, we have used the performance environment to benchmark those kinds of stuff. And you mentioned a bit earlier about the impact that the DevOps movement has had on the overall practice of managing and evolving databases. But I'm wondering if there is anything more you have to say on the matter of how the increased focus on automation

and continuous integration and deployment has affected the current state of the art in how databases are versioned and evolved and maintained?

Yeah. I think it has a profound effect because

previously, like before this whole movement came about, developers developed and handed off the schema changes to the DBAs, and they in turn deployed. And the deployment was very choreographed,

human choreographed, right, so that developers or the sys admins

deployed the JAR, DBS deployed the schema changes, and both the systems were then,

turned on and everything was verified. If stuff was broken, humans fixed it. Now when a developer makes a change, they basically check-in, and the change goes to production

within in some places within minutes, in some places within days, in some places within weeks. Right? So there's no human involvement.

And when that happens, there is no control for the DBA to look at what is going in and what is not going in. And is it right? Is it not right? And that kind of stuff. So for them to, like,

come back into the development cycle, right, so they have to move up the chain into the development cycle and provide a lot of automation

for all kinds of things. 1 is, like I was talking about, schema provisioning,

for the developers themselves. The other may be data like schema provisioning for QA people. Because there are so many versions of the application in flow at any given type, like, for example,

build 93 maybe in production,

build

102 maybe in UAT,

build 105 maybe in QA,

build 108 maybe is working or being built right now. And,

developers are working on the latest feature. So here itself, you have, like, 7 or 8 different builds going on. If there was a bug reported in build 93 in production,

some developer may have checked out code as of build 93

and

working on build 93 with the schema that was in presence or in work during build 93. So those kinds of things

get like, there is so much of work that gets created for the data people that it's automation is the answer to that. And they have to automate if they want to stay on top of all of this. So Docker, for example, is a good tool that we have put to use. Like, if you want a QA

database, you take this Docker image. You pull down the Docker image, apply the latest migrations, and the database is ready for you for that QA dataset. And if you want something

like, as of like, when you when we put some build into production, we create a Dockerized,

container and, put it in somewhere else. And if people want to, like,

work,

with the database as of build 93 that's in production, they can just pull their Docker image, and they can work with that. That's 1 good example of applying Docker to this automation

problem,

or automation requirement. And then the other part is, like, using tools like Flyaway or DB Maintain

or Liquibase to basically

choreograph these migrations that you're working. And once you check-in, you basically,

don't have to worry about because the build scripts that you build apply the right migrations in the right environment as you go forward.

Then you can also

create a lot of automation within build scripts. Like, for example, when a developer checks out, like, some new developer joins a project, they basically do a git clone, and they have the code on their laptop or desktop or whatever. They can literally

fire off a couple of build commands to get their own database environment set up. Like, we have worked with Gradle or Ant or Maven,

and you basically create a target for yourself, like, say, Gradle

init my database, for example. And it'll create a schema for you. It'll pull a Docker image for you, create a schema for you, put all the developer

schema as well as developer related data in it, and you have a schema ready for you. Then

in that case, this automation

really frees up time for the data people, and at the same time gives every team member a consistent view of what they are working on,

the environment stack,

the database stack, the versions of the database software you are using. All of that can be consistent across the whole team.

And looking back over the past 12 years since the book was first published, I'm wondering if you can call out some of the biggest changes that you've seen in the area of database design and evolution

and the tooling that's available for creating those designs and applying those various versions?

I think some of the changes are is a desire to use,

like, polyglot approach. Right? Like, previously, everything was in 1 database. Now people are much more open to using

different databases within the same system, like maybe something of a key value store for caching,

something of

a document storage for storing some content and things like that. So that's 1 definite change I'm seeing,

the ability or the desire to use multiple databases and things like that. The other is,

especially CI is much more prevalent nowadays everywhere. So people are using techniques that enable

CI on the database side also. There's also a lot more approach

or changes towards how object relational mapping tools are being used, like Hibernate and Toppling and those things, where new tools, like maybe when we wrote the when we wrote the book, like, Hibernate was

very new, and Top Link was also very new. But now they have matured much more, and it's a default way of,

using

access to the database layer. I think Ruby especially has made Ruby on Rails has especially made accessing, like, active record,

accessing the database through the ORM, a much more easier approach. Right? So that up, that usage of ORM layers has certainly increased a lot as compared to those days.

And the next

next thing that's also big is the whole approach of

distributed databases,

distributed applications.

Talking to a database having the need to have a replica, like, availability high availability applications

is certainly on the rise. And that's,

making people,

think a lot more different about their design and things like

that. And looking forward to the next few years, I'm not gonna ask you to predict for the next 12 years because the rate of change is just too vast. But what do you see as being the biggest challenges that we're going to be facing in the near term and midterm

in terms

of managing

database systems and data layers of our application stacks?

Certainly, I think evolving is a big thing. Like, as you go towards more bigger, like, polyglot usage,

of this data, like, if you make a change in 1 layer, how does it affect the other layer and things like that? That's 1 big thing. The other is all this, streaming data and Kafka

and HDFS, all the data that we are putting in. This need need to be world as we go. And how are we gonna

ensure that

the day proper data governance and proper data cataloging is available so that people can make use

of what's already available.

I think that's gonna be a big challenge as we go

further down this line.

The other, big notion that I see, as I talked before, is the DevOps movement. I think that as the DevOps movement gets more

popular and more,

bought into everywhere, this is gonna put a lot more pressure on the data side to make sure that they fit in this,

in this land of

DevOps enabled teams.

And are there any other topics of discussion that we didn't cover yet that you think we should talk about before we start to close out the show? I think we have covered most of this stuff.

And so I'll for anybody who wants to follow the work that you're up to or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question,

from your perspective, what do you see as being the biggest gap in the tooling or technology that's available for data management today?

I think

most of the database vendors,

other than the

NoSQL

ones, don't have tools

for automation. Like, most of the tools you look at,

from all the big relational database vendors,

they're all UI based,

geared towards an administrator

that's working on his laptop to do certain things.

Most of these tools do not work on a command line or do not integrate well in a CICD pipeline.

And that is, I think, a good,

place for,

database vendors to look at to see what

kind of tooling that they can provide around this,

so that it improves the productivity of the team as such. Alright. Well, thank you very much for taking the time out of your day to join me and discuss your work both on the refactoring databases book and

on all of your experience that you've had in the time since that was written.

So I appreciate that, and I definitely recommend anybody who's working in the space to take a look at your book because it's very informative. And I hope you enjoy the rest of your day. Thank you very much. You have a good day too.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links