Designing Data Transfer Systems That Scale

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Introducing RudderStack Profiles.

RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable enriched data to every downstream team.

You specify the customer traits, then profiles runs the joints and computations for you to create complete customer profiles.

Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack.

You shouldn't have to throw away the database to build with fast changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades old batch computation model for an efficient incremental engine to get complex queries that are always up to date. With Materialise, you can. It's the only true SQL streaming database built from the ground up to meet the needs of modern data products.

Whether it's real time dashboarding and analytics, personalization and segmentation, or automation and alerting, Materialise gives you the ability to work with fresh, correct, and scalable results, all in a familiar SQL interface.

Go to data engineering podcast.com/materialize

today to get 2 weeks free.

Your host is Tobias Macy, and today I'm interviewing Andrei Tserakav

about in yeah. About operationalizing high bandwidth and low latency change data capture. So, Andre, can you start by introducing yourself?

Hello. My name is Andre. I'm working as a

as a engineer for

quite a lot. I think it's more than 10 years now. And for last, I would say, like, maybe 6, maybe 7 years, I work for this

mainly on distributed systems and,

mostly focusing of data movement stuff,

moving data from point a to point b somehow, somewhere, some with some data, and I'm actually specializing in it. I have quite a big journey

before, but right now, I'm mostly focusing on this stuff. And,

I think I have quite a good experience in this area

from my past.

And do you remember how you first got started working in data engineering?

Actually,

it was quite a fun story because,

initially, I have my own startup, which is boring 1. We'd be building the bus transportation

application, bus, in terms of, like, cross city bus,

international buses, something like this. This laptop was fine. Everything was quite okay. And at some point, a big corporation buy out our startup. And,

after

a year inside of this big corporation, they decided to close a startup, and they proposed me to change my role. And, actually, after so many years working in this

fascinating from terms of of a business area, but not so interesting in terms of a technology to change my role, and I decided to change to more technical stuff. I always,

love working with the databases and, decide to go to the infrastructure team. Mostly, it was data infrastructure team. And, I start working on the project that

actually quite old,

I would say 15 years before. Once I joined, it was has already 15 years of history, something approximately like this.

And,

it moved a lot of data from,

a lot of different,

sources into 1 big data warehouse. This data warehouse was extremely big, and my goal was to

make,

delivery lock a bit less than it was before. And,

this was what initial

touch of anything in data engineering field. I think it was 2016

or 2017,

something about this. And, at the moment, they,

as a product that,

moves this data, utilize

a lot of

pretty old but still quite

good concept as a MapReduce.

So there there was, like, 2 main main pieces of this

software. 1, it's, like, delivering data from,

actual

producer of of the data. It's usually application

to a sort of a queue. It was not a Kafka because Kafka was not able to handle such load. It was internal replacement of a Kafka. It has some Kafka

flavor. For example, topics still named as a Visa Kafka prefix, but it's implemented in the c plus plus to make it little bit faster. Right now, I don't remember what exact number, of traffic was was in 2016. But in 2018, it was about 5 gigabyte per second. It's a lot of traffic that in incoming in this in into the system. And these traffics goes from the applications. Usually, it's worth written somewhere and deployed somewhere in the cluster, and they provide the logs. And these logs need to be delivered into

the data warehouse in a structured way as soon as as fast as possible. But initial design was like, we have this, application.

It provide data into the queue. From the queue, we provide this in a rough format in a a data warehouse. And then we have a huge pipeline of map reduced jobs that parse this data, index this data, and publish this data as a final tables. And

this whole pipeline was,

quite a big scale. At the end, it generate, like, 5 or 7000

different kinds of tables, and,

amount of data was extremely high. It's it's a dozen of petabytes of data. So it's quite extremely loud. The problem with all this approach, it was very good in terms of scalability. It scaled quite well on horizontally

because it utilized map reduce approach. It just chunking chunked of raw data by some side, then produce

transform it somehow, parse it somehow, index it somehow, push it in a table, and then, get another chunk, etcetera, etcetera. But deliver the lag between the actual event on the machine, on the application, and the,

final table, delivery lag was quite high, about 5 or 6. It was constant 1, so it's not growing

anytime. It's always 5 or 6 hours. It's actually not doesn't matter if your table big or small. It's always 5 or 6 hours. It's like constant,

tax for your for your data. I think this approach is more or less

well known because it's like batch processing, big batch processing. And usually, if you have a big batch processing, you have a delivery log in in this amount. If you have a daily table, so usually to provide a daily table, you need to end the day and add 5 to 6 hours for composing

this table at the end of the day. Something like this. And,

I was, like, quite shocked about all this situation because before before this,

I all only works with a a moderate amount of data. I use, like, classical databases, MS SQL, Oracle, Postgres. And they usually

if you have, 100 gigabytes of data in it, you have already big data. But once I saw the real big data, it was, for me, a bit shocking, especially the technologies they use, the approaches they use because it's quite different. And they'll they don't use anything like actual applications. It's just a bunch of Python scripts here and there that run everything. And no 1 actually understands how it works because it's 15 years of history. A lot of people change. A lot of, people go with the name. It somehow works. You can kinda understand how it works, but, overall, it's quite complex and frail system. Yeah. And it was my first touch with a big data world, kinda like this. And given that trial by fire, figuring out how to manage

this data transfer, the challenges associated with it, You decided that you actually wanted to invest in it even more and continue working in that space. And I'm wondering, given that context and the work that you're doing

now, what are the core foundational

elements of the problem of

data transfer for purposes of

as self solvable, I would say like this. Because in as self solvable,

I would say like this. Because initially let me give you a big bigger context again a little bit. This whole sito the whole system was,

front facing to actual user by a configuration in the central repository. So we have a repository. In this repository, you have a file. You have actually not 1 file, but 5 file. Files is a file 5 JSON files with all configuration for the whole companies. These files, like, 5 megabits of data. It's a big file big JSON file, which is not automatically produced. It's actually produced by users. So if user want to add 1 new item to this pipeline, you need to commit in this file. Then the CI job will trigger some Python script. This Python script will trigger another Python script, and the whole machinery starts working by magic. A user just communicate

with all the system by changing in this pie,

JSON file. And this is self sort of sole service approach is quite good for the data engineering because at the end, if you're providing this as a service, if you're as engineer who designed and developed the application, for you, it's very complicated systems that do a lot of stuff. It's a lot of code. But for end user, for actual data engineer who just want to see the data in the table, it's not important. They just need to somehow set up the delivery from point a to point b. And the idea that, come to my mind is actually what's actually approaching me even right now is to give the tool that can be can be used by people who don't really care about the details about implementation.

Just 1 important thing, I need to move data from point a to point b. Just do this somehow. I don't care how. I don't care,

I I don't care the details. The only thing I care, it's about freshness of the data and consistency of the data. Just you don't see me that consistency would be there, and

freshness would be not so bad. And,

idea of making this self storage approach is actually very

core for me

in in our product actually even right now, the ability to give this as a sole source. The only problem that,

once you start doing this, you you start realizing that moving data from point a to point b is quite a complex task. And

giving this as a tool rather than, like, a framework for development is a bit more makes everything complicated

and, make,

the configuration of all this pipeline

extremely complex. And this actually

very challenging task to give the user ability to, run this. It's it's a first step. Then investigate and monitor this because once you run it, you need to somehow monitor it. And once it broke, you need to, add some tools for investigation and triaging and debugging. And all these pillars actually is a foundation for good service, for providing,

service for moving data, not just a JSON file with all configurations.

And the the shift from this JSON file of with the configuration to some sort of a service that you can configure

on your own and use it as a as a tool in your daily work, it's, like, quite challenging. And in your work of building these tools, trying to make a self serve capability for being able to manage those data transfers,

what are some of the elements of the existing solutions in the ecosystem that you find to be lacking and that necessitated your investment in this problem and continuing to build out your own solutions?

Well, we actually started

building this solution, I think,

2017, 2000 18. First thing that we missed is actually good library of connectors.

The things that you need to connect multiple sources, multiple

different kind of sources, actually,

right now, there is a whole bunch of open source technology

on on on the market. They

kinda can provide you some capabilities.

There is, like, Kafka Connect for connecting to the relational databases. There is Arbyte for connecting non relational databases.

There is, like, Stitch, Meltano,

anything everything, like, in this,

world that connect allow you to grab data from somewhere is, like, connected to to your to your system. But the problem that quality of these connectors usually is unpredictable,

and,

performance is maybe an issue. For example,

we have a big scale of all this stuff. We have a lot of data.

And using the

default connector, for example, for

Visa, Kafka Connect was quite expensive for us. We have to scale the system not just horizontally

and vertically, but actually be able to we should be able to run as many as as many as possible small connectors. And usually, the

for example, the Kafka Connect 1, it's most popular implementation of changes that are captured for the open source world.

Run of single connector usually require you a pretty decent virtual machine in, 4 gigabytes of RAM. And, if you have, 2, 000 databases

that run these connectors, it's pretty a lot. Some of them is actually reasonable to have this 4 gigabytes of RAM inside of the cluster to allocate to this database because it's decent load. But most of them actually produce 1 or 2 rows in the in the middle. You still need to somehow

do the transfer, And you have to think not about just scalability in terms of, like, how many instances you can run, but how cheap the cheapest instance

is. 1 of the key reasons why we initially decide decide to go to the own implementation is to reduce the overhead that usually comes with the open source solution and make it as simple and and clean as possible,

as and focusing on the,

target,

connectors to make it not like a big library of connectors, but, like, small, but really good with good scalability and good

performance to consumption ratio because, usually, once you start something, you consume some resources. It's usually CPU and memory. Sometimes it's something else. It may also be, for example, network. But usually, you're bottlenecked in the such scenarios is CPU and memory, and you need

might minimize them. And,

having the generic solution like Kafka Connect,

is not efficient because it has a lot of, extra overhead. And, initially, right now, it's not the case, but, like, 5 years ago. For example, Kafka Connect can't live without Kafka. And it was a it was a big deal because if you want to connect your database

and read this data and put it to a data warehouse, you have to put a Kafka in between no matter what. And it's actually a bit a bit of overhead

in some cases. And, we try to eliminate this extra overhead and connect them directly because why not? And this is,

another point why we should why we decide to implement it on our own. In that space of data transfer, there are also a number of different

patterns and

requirements and constraints based on the type of the source system. So change data capture, as you mentioned, is 1 approach to reducing the overall latency of data transfer

API endpoints where you need to be able to query the the rest API to be able to get new data. Not all of those even support having some sort of incremental sync capabilities. I'm curious how that complicates the overall problem space of saying, okay. Well, for these data sources, I can do this, and I can make sure that I've got subsecond,

data transfer. But now when I'm dealing with my Hubspots or my Salesforces of the world, I have to do a different approach, and maybe some of these endpoints support incremental other other endpoints. I have to do a full refresh. I'm just wondering kind of how that compounds the overall problem space. It's make everything more complicated, Justin.

That that's actually the answer.

You can, like, try to minimize it somehow, kind of try to manage it, but overall, it's still goddamn complex. In general, we have, like, a huge matrix of of compatibility

of different connector, like this connector compatible with these features, this connector compatible with these features. And then we can when we compile them, we, say that this transfer, like source and target endpoint pair of the source and target endpoint compatible with those features because they have, say, compatibility metrics intact. In fact in fact, we actually

the idea that we've concentrated on a selected set of, connectors give us an ability to tune a little bit our approach. And initially, we just have 1UP. It was the first iteration of a service. You have CDC and nothing else. Then we start adding the features 1 by 1, compatibility

features like 1 by 1. 1st, we add the backfill functionality,

which we call a snapshot. And this backfill functionality,

then we combine with

replication in in CDC. So you backfill and then replicate. Then we start realizing that not everyone contains the exchange streams, and we're adding the incremental snapshot. And right now, we have, like, all set of, options available. But, initially, we start with a low a low point. Like, the first point was a change that I capture. I think in

the most, tools, the change that I capture is a most com com complicated,

approach because it require

more detailed implementation of 1 particular connector.

But we start with the hardest 1 and just adding the, anything else around it. And we build everything around this concept of change data capture. And what we try to do is utilize the same approach to any other way of reading data. So we imagine any database as a stream of changes. And this actually

simplifies

the design, overall design of the system and give us

ability to build

this extra feature easier. So stream of changes, it's a simple concept. For CDC, it's more or less natural. For snapshot, it's not so natural. But what we say is like, okay, what is the difference between a snapshot

and a change that I capture? The only difference, like, there is a stream of changes, but at some point, it ends. That means that your snapshot is ended. You read all your data. And if you have a CDC, this is infinite

process. The same for incremental things, actually. This incremental thing is just infinite loop of changing

stuff. It's still a stream of changes, but these changes have more granular,

burst nature.

At some point, they start reading new data, burst your new piece of data in the stream, and then, like, sleep for for a while. So having this in the core of a system, this concept of a stream of events, is actually helpful for us. Another thing that we actually

designed, it was

quite okay concept that we chose at the beginning, and it right now would be helpful for us, is treating everything as events and, treating

any change as a logical event, and logical event bounded to a string,

bounded to a row, bounded to a row, 1 particular row. And this row can be bounded to a table. Initially, we think it's, like, more than enough for us. But once we start adding more features, we add a concept called technical events. And these technical events give us the ability to add extra

features to the pipeline. So overall, we have just a pipeline or a stream of events, infinite stream of events. It can be finite if you have, like, a snapshot. But for the code itself, it's not a big deal. It's not there is no difference. But in this,

infinite stream, there is, like, a checkpoint, for example, or maybe some events for some technical details, give a copy of data, for example, or run the DBT transformer

or something like this. These technical events give us ability to inject functionality on the stream and give it, like, a flexibility to add the functionality.

So the core concept, like, to summarize all of this, is treat everything as a string of of events and treat events as an event. So they can be either a row or something not related to row. Some events for

non data related activity. Like transaction starts, transaction ends its events, but there is no data. Data is somewhere in between. But these transactional events allow us to give more guarantees for actual users, like for moving data on a transactional

side. So we have trans

the transaction boundaries intact,

something like this.

This episode is brought to you by DataFold,

a testing automation platform for data engineers that finds data quality issues for every part of your data workflow from migration to deployment.

DataFold has recently launched a 3 in 1 product experience to support accelerated data migrations.

With DataFold, you can seamlessly plan, translate, and validate data across systems, massively accelerating your migration project.

DataFold leverages cross database diffing to compare tables across environments and seconds, column level lineage for smarter migration planning, and a SQL translator to make moving your SQL scripts easier.

Learn more about DataFold by visiting data engineering podcast.com/datafold

today.

Given that core

assumption and the core logical foundation

of these

streams of changes

for the situation where you do have a data source that will only give you a block of data. And every time you ask for it, it's maybe the same block of data with a few new things in it. I'm curious how you think about trying to map that into the set of changes. Do you then have to maintain history of, this is the block of data they gave us last time, and now we're just going to remove any duplicates and only add the new things to the change set? Or just curious how you think about mapping that into that problem space.

Yeah. Act actually, it's both way. The first thing first, you need to treat every block data to block data, but sometimes you wanna,

to short

cut extra cost because reading this block data is not cheap. It's not for free. And,

you need to store some server's

course of stuff so you know the checkpoints. And these checkpoints,

can be used as

a part of a query or API for your source

to not, like, just grabbing this data in the first place. And, this actually

concept that we utilize for incremental snapshot lag. So we store in the cursor and then use this cursor to minimize the amount of data for the next iteration. So we don't just read everything. We just read smaller. Power applications is even easier. Once we

receive an answer from a target database that everything is written, we can move the cursor of our change stream a little bit further and, continue reading. The only tricky part here is on restart.

And this is, like, as a quickie question appears. It's

a question about delivery semantics. It's usually an important 1. And initially, we aimed to the at least 1 delivery semantics. So every every time if something goes to the failure,

we restart our worker, and this actually

give us some duplicates

of data because we send something to a target database but not receive

acknowledgement. And this acknowledgement

is in a gray area. It can be either not delivered because it's it's okay. It's not delivered. But sometimes it's written, but we don't know about this yet. And, they

we we read stuff from the source database, give us a duplicate of this amount of data. What actually can be done with this?

Actually, nothing can be done particularly.

The only thing that can help is utilizing transaction capabilities for target database to store some sort of marks on this target database in in in a transactional way and remove these items from a write

batch in on the target side. So you still read this data a little bit more, like excessive data, but you should not write this in a target database to prevent duplication. In most cases, it's not a big deal because you have a primary case. Primary case duplication works best for almost any cases if you have a primary case, but not every database has a primary case. For example, s 3, which we actually treat s 3 as a database, is not contains any primary case duplication. So we need to invent some sort of a logic around it. But it gets it can be implemented

if you have the right concept in in core of this logical event. And if your logical event contain,

some sort of

sequence inside of it, you can easily deduplicate data based on this sequence. Most of database, most of change streams contain sort of a sequence number, and you can just utilize the sequence number to deduplicate stuff on target database. And we actually utilize it a lot. We provide for all changes the sequence number in it and just deduplicate the sequence number based on the sequence number in our target side. And now moving to the

consumption side of a transfer service where you say, I just wanna make sure that my data gets from over here to over there. I don't care about all of these details. I just wanna be able to click a button or say, this is what I need, and everything just happens. And I'm curious as you have been working through this problem, building a product around enabling that interaction case, what are some of the sharp edges that you've had to deal with and manage,

and and develop or an engineer around

to be able to give that end user experience of not having to care about all of these vagaries of the underlying technical implementations

and

the failure cases and recovery and retries?

I would say that, what I learned from all this stuff

is happy path is easy. The simple stuff works pretty straightforward,

but there is, like, edge cases, and edge cases are really hard. The first thing for problematic things that we actually, faced on,

on on our journey to implement this self-service,

kind of magically scaled solution is

sometimes

user don't want to scale. Sometimes

the scalability should be

more

controlled,

etcetera.

What I mean by this is the transfer itself capable to read any amount of data from a source database

just give us enough throughput.

And the problem, if users don't think about this, like, throughput problematic,

we can use all throughput from database. And there is no more left for any other users. In some cases, it's okay if we are only user of this data. But, I think it's rarely a case. So first thing that we I learned that actually this magic

should be controllable.

A user can actually control our scalability points.

We provide some,

insights, some settings that user can actually adjust,

based on their needs.

Rather small 1, so they don't want to

run more than 1 1 or 2 concurrent queries on top of it. So they can specify,

I wanna limit your capabilities to 2 concurrent connections.

Deal with it. And then we should do as much as possible with this limitation.

So users should provide us some limitation

and we provide them,

this ability to give this limitation. The only thing that harden this scenario is it, like,

explain to a user how they Because,

as as I already mentioned, most of the team just don't want to think about the

complicated stuff like, okay. My Postgres is overloaded.

What should I do with it? So we need, like, a lot of examples,

best practice, and most importantly, reasonable defaults. So I would say in 80%, maybe even 90% of the case, users don't change our default,

and this is more than enough. But having these defaults is a very hard 1. You have a lot of you should have a lot of exploitation expertise

of the

overall system to understand the best practice

and put it in the right place. So we have a quite smart default

that actually adjust it based on your target and source databases,

based on your

type of a transfer, amount of a tables, etcetera. And these defaults,

actually

hard hard to pick, hard to adjust.

But user can be

can adjust them if if they want. But in most cases, they shouldn't do this.

And

as you have

moved beyond just the initial

capabilities of, I have the system, it will work. I can do a, b, and c with it to now I need to actually run this at scale. I need to make sure that I can automatically provision capacity as new people come in or as their use cases grow. What are some of the challenges of operationalizing

this complexity while still providing a an accessible and enjoyable end user experience?

I would say that main main challenge here is,

providing,

providing this is easy, but you should keep track of your costs because,

once you provide something as a service, like you're hiding some costs, you're hiding CPU, you're hiding compute under the hood. So users don't think about this, but you should think about this. So you have a should have a lot of monitorings in place. You should have a duty, shift duty for your engineers. So your engineers

always on the call because your system became a critical 1. If something is broken in your system, a lot of people can be

affected. So you should

always

take a look on this. Someone should be always online. Someone should be always

on call. So on call is important. Also important is

monitoring and alerts. So everything should be automated. Some weird behaviors should be automated. Right now, we don't have, like, enough,

in my opinion, alerts. A lot of stuff just,

flooring just by observing the monitoring.

You see a strange pattern on the monitoring, and you then go to investigation, then maybe find some bug, maybe some interesting use case, etcetera. And another thing that is important to give a scalable experience is providing,

building service with,

runtime agnostic in mind.

We build our system

initially as a runtime agnostic 1. So anything can that can run the Docker container is our runtime,

and we can host on top of anything.

And the only thing that we need to care is that we have enough capacity of this underlying system. Right now, hopefully, we have AWS, GCP, Azure that, take a head of thinking about actual physical machines out of our heads. So at least this point is solved, but you still need to manage,

alerts for costs, for amount of virtual machine, etcetera.

This is still quite important

because once I joined the company back in the days,

we host all this complex stuff, complex solution on the physical machines, in the physical data centers. We have, like, several big data centers with actual machines in it, And, sometimes failures be was not because we,

don't have, like, good coders, maybe

not enough optimization, but sometimes just too many traffics come into the system, and we can't scale it because there is no more physical machines available in this particular data center.

Hopefully, right now with the clouds,

it's not so painful

to,

scale up and scale down because you don't need to buy and wait until the physical machines come to your

warehouse where your data center is located

because

sometimes it can be quite problematic to scale a solution if you run this,

not run time agnostic, not in the clouds.

Data lakes are notoriously

complex.

For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte scale SQL analytics fast at a fraction of the cost of traditional methods so that you can meet all of your data needs ranging from AI to data applications to complete analytics.

Trusted by teams of all sizes, including Comcast and DoorDash,

Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises.

And Starburst does all of this on an open architecture with first class support for Apache Iceberg, Delta Lake and Hoody,

so you always maintain ownership of your

data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst

and get $500

in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.

Another aspect of complexity as you scale is

not just the

throughput capabilities of I can handle x number of gigabytes per second or I can handle x number of users, but there's also the scaling of

additional capabilities, additional complexity

as you move beyond the initial focus of I need to be able to manage change data capture. I also need to be able to start adding functionality around maybe API sources.

How how does the combination

of operationalized

system that your customers depend on with the need to be able to continue feature development

and overall improvement,

make the development process

harder? What are the aspects that cause you to have to move slower than you did at the beginning? Just wondering how you think about the evolution of the system and being able to design and architect it in a way that doesn't cause you to

in a way that doesn't cause you to stagnate and have to stop development to be able to make improvements?

Yeah. That's a hard 1. Tough question, actually,

because building is a scalable solution that can run-in the cloud, that can scale on thousands customers.

It's not that hard if you build it, like, hard tailed,

focused on couple connectors.

But we actually focused with the growth of connector base,

and,

the the main thing that,

actually give us this ability is 2 concepts. The first concept is pluggability.

We define now a system in architecture,

certain extensibility

points that actually can be used as extension

and build, as much as possible,

inside of a core.

Like, this core is working with, just an abstract concept. They don't know anything about 1 particular database, 1 particular feature, etcetera. This is

stuff that is common for everyone.

And then we have plugins that implement certain of aspects of connectors,

and these, plugins in total

compose a connector that connect to the database, to API source, to a target, to the storage, etcetera.

And this pluggability is

extremely important if you try to build,

a scalable in terms of a development solution because

scale of features is important.

And another aspect that's actually important, build being just a pluggable system is not enough. You have to have a lot of

infrastructure for unit testing and component tests,

and we have extensive support of,

tests, and we test everything. There's a lot of excessive stuff.

And,

initially, we,

designed, like, a sort of a test framework for our connectors, and it give us ability to to run a test and build a new feature without trying our solution. So,

enabling sort of extreme programming practice, which is like test driven development. We,

in some cases, like, I would say in most cases,

it may be easier to write a test rather than

doing something different like debugging it locally or maybe on some development stance. Sometimes it's easier to just write a test, make it green,

initially make it red, then make it green, and then deploy the feature. And you have

such flexibility

and easiness, so in which we invest invest a lot of, test infrastructure

to improve developer experience because developer experience, if you talk about

scaling in,

development efforts,

it's extremely important.

And we invest a lot in the in development experience, to be honest.

So 2 main pillars of this, a pluggable architecture,

make this everything as as as pluggable as possible. And second point is a development development experience and tests test infrastructure.

I think this is our

approach. I would say our our current team is quite small for the amount of code that we have, and the only reason why we're still able to support this

huge code base is, because a lot of x a lot of tests.

Just tests give us this, assurance that everything is more or less fine.

We have quite big code base, to be honest.

And when I was looking at the data transfer product that you're building, I noticed that in addition to the change data capture functionality, you're also

adding support for being able to use some of the air byte connectors capabilities

for some capabilities for some of these source and destination connectors.

I'm wondering

what your overall thinking is about the current state of that ecosystem,

what it is about the set of airbag connectors that makes it worth integrating with that, what your overall thinking is about the potential for

future evolution and development in that space, and maybe some of the considerations or concerns

that you have or that your customers should be thinking about as they decide which of those connectors to actually bring on board?

I would say right now, the state of the data engineering in terms of open source is a lot better than, like, 5 years ago. Even a couple of years ago, it's it's rolling quickly. And

I think 1 of the most important project in all of this,

bringing in the

engineering

in a good quality is I think Ira Byte is a bigger biggest win in terms of open source community.

The quality of Ira Byte evolving

in the last year a lot it's become a lot more stable, a lot less failures,

random failures of connectors, but yet it still, require a lot of engineering efforts to run this perfectly.

Through my experience,

running Arabyte connectors require you at least a couple days of investigating how it works

and,

reading the documentation, and then you can finally run it.

This is, I think, is, unavoidable because some connectors is just hard to stop. Like, you need to run a tokens. You need to prepare right scopes, etcetera.

But, overall,

it's working, and it's good.

The only 1

visible

flow from, RBITE perspective

is extensive

cost that come with RBITE.

The design of Arbyte built in, this,

extreme applicability

idea.

What we, we also do as a a inside of this, as a pluggable design, but we do some shortcuts for us to gain a little bit more performance.

So IRBY itself internally wrap everything in a Docker container and communicate between the source and the target via pipe,

or file.

So at least you have at least 1 serialization and deserialization in between. And they use this pipe structure to communicate and decouple actually source and the target, which is good because source can be written in 1 language and wrapped in Docker container and target another language and wrapped in Docker container, which is extremely good for

developer's call development scalability. So you can write a lot of connectors easily. You can write connectors in JavaScript, in Python, in Go, in Rust, in whatever you want.

But this actually brings down as a point. Another pain point is that

as a side effect of such flexibility,

you have,

performance,

slowdown. You have this bottleneck, which is a file usually between source and the target, and this serialization, this serialization, especially if you have a Python connector, and most of connectors written in their bytes is Python based

connectors. You have this bottleneck of serializing the data on the disk and then reading this data and deserializing back.

We internally read everything in Go and Go extremely good for running performance application,

and we don't serialize data in between. We

use only in memory structures, and this give us huge performance boost, for for our native connectors. But writing this connector is harder because

it's not just

a random,

Python code that you bring in your application and wrap in Docker container.

This,

limit our

capabilities. I think in modern world, it's like a trade off right now,

between

scalability

of, development and

scalability of,

application because Arbyte gives us extra cost, and you can't unavoid it. Another thing I would say is that right now, there is still,

plenty of

open questions in terms of implementation of

change that they capture. Right now, the implementation of change that capture in Erbate is not so good.

It still have some flaws

and problems in,

and it's actually still quite costly in terms of resources

resource consumption and running the,

classical,

change data capture

of a pretty cold database, which change, like, once in a while,

is still extremely expensive.

And, this is still a a big problem because sometimes and I would say most of the times, the open source community is mostly thinking

about growing in a in a in a horizontal space, like improving amount of connectors. And I think it's a good focus in terms of Arrabight because they actually attract a lot of people

by saying we have all connectors that you want, and it's actually kind of true.

I I don't remember how exact amount of connectors they support right now. But 1 particular connector can be improved as well. And sometimes you need to invest a lot of times in these connectors. And some connectors should be extremely

tuned fine tuned to the to the extreme because

if you were reading data from API, you don't care much

care a lot about the performance because API not provide, like, terabytes of data. But if you

are reading the data from a s 3 bucket and try to move them in, like, for example, Cassandra,

you have to think about performance because in the s 3 bucket, you usually have a terabytes or maybe petabytes of data. And moving this petabyte of data with

code that's somewhere in between serialized and deserialized data twice with a Python code is extremely inefficient.

And you think you still need to think about this,

if you want to scale vertically on 1 particular pair of, of database. Yeah.

But, overall, current state of the world is quite amazing.

I do like a lot, RBITE as a concept, as a product, and the overall

overall do, like, a big fan of RBITE, to be honest.

With the work that you've done in the space of managing data transfer, trying to productionize

and productize it,

What are some of the lessons that you have learned in the process about,

how best to design systems to be conducive to data transfer or how to architect analytical pipelines to maybe minimize the need to actually move data. I'm just curious. What are some of the learnings that you've had from this experience that have influenced the way that you think about how to do data engineering overall?

I would say the main lesson I learned is, like, keep it simple. I have on my past experience a lot of over engineering data pipelines

with

a thousands, not even, like, maybe even tens of thousands,

steps in

a graph, like, out of flow graph with tens of 1, 000 steps in it just to,

build a table. I like the

directions that right now all data movements

all data engineering is moving. It's

leaning in towards the simplification

of everything.

I do like the shift that made, like, couple years ago from ETL to ELT,

and I would say,

main lessons that,

should learn by data engineers from all of the sites, like,

try to be as simple as possible,

and making data transformation,

is

exceptionally

hard,

and try to avoid them as much as possible

and try to, delay them as much as much as possible. But by by by this, I mean that don't try to build complex pipeline of transformation. Usually,

it's

very fragile constructions.

Sometimes, it's necessary for sure. But if you can avoid this, it's better to be avoided. And,

modern approach of transforming data on the target side, like,

DBT or something like something similar,

I think is a superior

if you have enough computing resources.

I do understand that modern approach with this, dbt stuff with the Snowflake and Databricks sometimes cost a lot, but not every database has a Snowflake.

And, also, 1 another interesting

topic,

I would suggest to try as much as possible, build everything on top of open source stuff rather than building around the vendors

and proprietary solutions

Because vendors,

even if it costs not a lot right now, it may cost a lot in the future when they understand that you're already on a hook

and you can't avoid them.

Vendor lock is,

extremely,

harmful in a data space,

And I think right now, a lot of harm was done by

the overuse of a Snowflake on on a lot of cases. It's not needed, to be honest. Snowflake is extremely capable solution, but it's

shouldn't be a default 1 because

you have you can do done a lot of stuff just in the Postgres.

If you can fit your data in the in the Postgres, use the Postgres. Another interesting

insight from my side.

Because,

I'm internally a big fan of also Postgres' screen, and,

I think that anything below the 1 terabyte of data should live in Postgres and not leave it, like, just being just with Postgres. It's enough for you.

My, marketing manager would,

agree about this because we're trying to provide ClickHouse as a service. But in a lot of cases, ClickHouse is a perfect database, but not in any case every cases.

And,

another thing that I would suggest is, like, there is no silver silver bullet. Snowflake is not silver bullet. It's a golden bullet. It's cost a little bit more. But

be ready to be flexible in your tools, choice of tools, and not, try to keep them simple.

But, again, don't stick to them too much.

The open source solution's usually easier to, abandon,

and, having this ability to abandon your solution, if you understand that it's not work well for you, it's important.

Just give you a little bit of freedom of of choice, like, l some some gap here.

Try to build this, your your pipelines, your

systems, your

data ingestion

workflow

as as abundant as possible. So you can just, like, delete it and create once again

on different,

rails with a different technology stack.

What else I can suggest?

Don't overuse Python.

And

in your experience of working in this space, figuring out how to build these solutions, working with customers? What are some of the most interesting or innovative or unexpected ways that you've seen this challenge of data transfer addressed?

Oh, I have a perfect example. A perfect example.

So, I have a customers,

which is quite interesting 1. They design against self driving cars.

Self driving cars,

is the

training model, basically.

So there is a car, a lot of fancy stuff on top of this with the leaders, with

cameras, etcetera. And inside of it is just a big computer

with robotic OS

that collect all this information and somehow to analyze it. And,

they have a challengeable challenge task.

Each machine, each car, actually,

like, machine car because,

inside of it, it's just a computer.

Each car goes around the city, collects this, telemetry information,

and somehow they need to analyze it centrally

on the big cluster.

And they need to move this data from machine,

on the ground from a car to the cloud somehow to analyze them later by machine learning guys or data engineers.

And, how to do this? Because usually machine

on the cities,

has a very limited capabilities of Internet connectivity. You have, like, a 3 g, maybe 5 g in in a in a good weather,

capability.

Cell connectivity. Nothing else.

So,

they what they do, they split data into 2 pieces. Like, important 1, a pre aggregated 1 is collected on the on the on the car itself and send it via,

mobile cell to the cloud somehow

by simple ish typical. It was not not so complicated. So they have a challenge

that they need to aggregate somehow information inside of a car before the sending this. So what they do, they install a click house on the machine, on the on the car itself, and start querying the click house to do aggregates of a data that collected by this right

and sending, to the cloud. But then they need to,

analyze this information at home

with these rough information, not aggregated 1. But,

how to achieve this? So they park machine in a in a garage,

and,

what they did, it's,

they connect machine to the proper network, to the LAN network.

And,

once the machine become in the LAN network, they connect the click house the node of a click house that run inside of a machine to the cluster of a click house, and they start to replicate the data.

And once machine,

in the car in a in a garage,

they replicate data. Once that is replicated, they go again for right.

And,

this was, like, the

strangest cluster that I saw in my life because each machine is just a node in a cluster.

And once the machine come back to garages, the node become a synchronized 1. And once machine in, in a in a in a city, it's just detached node. It's, like, go offline by some reason. And it's not just because,

outage. It's because they exploring the world.

And,

it was very

interesting way of, solving this problem because, initially, they tried to build it very complex, like, graph based

Python scripts

with the Pyflow in it,

but it was so,

clunky and

not robust 1. They always failed. Like, machine come to the garage. The graph has failed

because of some bug in the scripts, and this was hard to, implement and they don't want to work on this particular area. They

wanna focus on,

machine itself, on sensors, on machine learning, and all interesting stuff. And this techno technical stuff was not interesting to them.

And they, like, come to this interesting approach of using a machine

as a

as a node in a cluster.

The machine itself, like car, actual car. It's a wheel in it.

Yeah. It's definitely a very cool and different approach to that problem.

And

you've definitely built a very interesting product, very capable product, but for people who are trying to figure out how do I want to get my data from point a to point b, what are the cases where the double cloud data transfer product is the wrong choice?

That's actually interesting. I would say that if you have,

data spurs in a lot of different systems,

like,

API stuff,

I would suggest better to use a vanilla Ironbite 1,

because

we need to initially

internally try to minimize

amount of,

support work that come to us and therefore not open every Aramite connectors that have. We just reselect,

several

most prominent and

stable 1. And,

so if you have, like, a huge

different sources like

Twilio API,

some Zendesk API, or and and and so on and so on. We have dozens of such.

I would suggest you to just use Erbyte,

just

because it's easier to understand

this,

without

the middlemen. Because in in this case,

these API sources,

first of all, they will be hidden because we don't show the all API sources from everybody. And, second of all, even even exposed, we will be a middleman here.

And,

we're more focused on

a classical

movement from a more database

style stuff rather than,

API stuff. So if you have just an API sources,

just use API,

RBITE here.

It would be better for you to evolve.

And

as you continue to

build out the double cloud data transfer product, invest further in its evolution, add new capabilities? What are some of the things you have planned for the near to medium term or any particular projects or problem areas you're excited to explore?

Well, we have a lot of in mind.

1 of, approach that we actually investigating

right now, it's not yet defined, but we're thinking about this quite heavily. It's open sourcing this stuff,

open sourcing our data plane to,

make this publicly available because code itself it's pretty

agnostic. It can be reused by a community itself and actually get

some love from a community and some bug fixes, to be honest, and some bug reports, to be honest, as well. Another thing that we try to invest a lot is improving in terms of,

usability and debugability,

allowing more information,

about what happens inside, what can be goes wrong, how to can you can fix it. Because,

usually, a lot of stuff can happens can go wrong here,

and errors need to be fixed,

mostly on a on a source or target site by actual customers. Like, password isn't correct, network connectivity is not not set up correctly, etcetera.

And we try to improve our debug ability here. And another big,

challenge of our improvements, what we want to do is adding the transformation

capabilities.

We right now introduce in a in

a in a memory transformation,

with SQL queries. We use a lot of, like, ClickHouse in internally as a as a internal

engine.

And,

it's quite powerful tool, and we try to improve this area as well.

And what

another focus that we have,

is adding more native connectors. That connectors that you read by ourself that not is inherited by Arabyte.

That,

designed

specifically to transfer

petabytes of data. And 1 of these connectors like, Iceberg, Delta Lake, and all the history based connectors, we already have pretty good compatibility

with them, and we just want to improve this,

because

Ira byte itself is good with a small amount of data here.

But,

with a petabyte scale, it's not it's not gonna work. And, we wanted to invest,

some,

our

expertise

to provide ability to, move,

huge extremely huge,

delta table or isenberg tables from point a to point b. Because, usually, if you have this kind of tables, you have a lot of data. And it's not, like, gigabytes. It's not even terabytes. Sometimes it's even a petabyte.

And, we want to move them as as fast as possible.

Yeah. That's actually 3 our main focuses,

right now. It's, like, capability,

native connectors for extreme scale, and, possible open source. Yeah. Are there any other aspects of the overall challenges of data transfer and the work that you have been doing on the double cloud product that we didn't discuss yet that you'd like to cover before we close out the show?

I would say,

interesting aspect is running something

in a cloud,

especially once you strive start to,

run this in isolation.

I would say,

sometimes,

data engineers and data engineering

people,

doesn't

give enough attention to the network isolation,

and network isolation is hard.

A lot of tools nowadays not provide you proper,

network isolation. They run somewhere in the cloud, and you need to expose your data sources

to the Internet.

And I hear a lot of complaints from security people who actually,

run this, stuff,

that, exposing something to the Internet is a bad idea. And,

once we implement the transfer in a in a AWS

and GCP runtimes, we spent a lot of a heck of a lot of time

to,

provide the transfer and at a scale and a cheap cost

and inside of network isolation, inside of your private nets. And,

networking

is hard

even for the engineers that you,

specialized in networking.

But, data engineers still need to understand the network some some sort of. You still need to know that your database is not just just exposed somewhere. If you are able to access your database from your local laptop, it's not means that everyone can do so. Most probably, you have a VPN somewhere.

And,

connecting 2 points in the in your data

land landmark is,

can be hard if these 2 points in a different VPC or different cloud providers.

Connecting cloud providers itself, it's quite a challengeable task.

And,

making this compliant is even more challenging task.

Yeah. So,

in this if if you are a data engineer nowadays, just invest a little of your time to understand what is the private nets, what is your networking

stack of your current company,

and, and all this stuff because networking is hard, and it still need to be managed. It still need to be handled by somehow some someone. And if it's handled not by data guy, he will just, like, forbid every access to you. And,

it it it would be painful to enable this back again.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

It's a hard question.

I would say the biggest,

missing biggest gap

right now is,

I I would say out of box cloud native solutions that you can host on yourself.

The closest to 1 to to this area is Arbeit, but it's not, like, really cloud native.

It's run everything on the same node on the same cluster,

but running any anything

something on a scale require you

running this

on infrastructures

a little level, like, a bit lower level, and we don't have anything right right

right now. Also, another thing that is a bit in a gray area right now is a data government tool. They evolved a lot in the past years, but data government is actually exceptionally hard

because

quality of data is important.

And, if you just move data around, it's not enough for your business needs. You need to not just put this data on a target database. You need, actually

put some effort,

clean up it, make it more quality 1 for later consumption. And data quality tools, data governance tools is,

extremely important. I would I I see some tractions in this area for the past couple years. There is a couple good projects,

that's

popped out in this area. I don't remember the names,

but

are they mostly the the the 1 of the biggest actually wins in this area that actually give a lot of flexibility is DBT tests. The DBT test is actually a good starting point for you to make a data governance. Writing tests on your data is actually important as well if you wanted to build a robust solution.

Another interesting

area that is evolving right now, it's not so problematic right now because we have good,

solution on the market in open source world,

but data cataloging calls are important because,

seeing how does a data goes from point a to point b,

it's extremely important on a complex system. If you have big system,

it's really hard to track how this field was calculated in the first place. Where is the root cause?

How what what was involved

and what actually may broke? And if you broke something in between, what actually is affect how how big impact of this failure?

And,

there is 1 particular,

particular, set of tools that help you. It's a data catalogs. I think right now most popular 1 is open metadata. So,

this is this gap is more or less filled.

But I think adoption of an open metadata and the data cataloging itself is not that high right now. So maybe this is would be evolved in a couple

next years.

The data catalog

is not that hard to adopt,

and it actually

enables

data

inside of a company by a lot because they expose your data not just for the for data engineers but, actually for business people.

Also, they can actually explore what actually you have for later analysis.

So this is important.

And I think adoption of a cataloging

also is a is

a is a a gap, not like tooling, but the adoption of the solution. Absolutely. Yeah. That's a space that I'm looking to invest in next myself, so I appreciate you calling that out.

Well, thank you very much for taking the time today to join me and share the work that you are doing on the data transfer product at DoubleCloud and your overall experience working in the space and building solutions to address this very necessary problem.

Appreciate all the time and energy that you've putting into that, and I hope you enjoy the rest of your day.

Thank you. Thank you.

It was a pleasure to meet you as well.

Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning.

Visit the site at dataengineeringpodcast.com.

Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story.

And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links