Building Auditable Spark Pipelines At Capital One

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Outland as an internal tool for themselves.

Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth

$3, 000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Gokul Prabhagaran about how he is using Spark for real world workflows at Capital 1 and some of the patterns that he's developed to help in his work. So, Gokul, can you start by introducing yourself? Thanks for, first of all, having me, Tobias, and

I'm really excited to

share,

what

we have done.

And before we, I think, start, I'm Gokul Prabhavan. Currently, I'm a engineering manager at Capital 1.

My primary responsibility

is managing the platform and the application, which is responsible for processing

customers'

credit card transactions and computing

the earn for those transactions.

In simple terms,

like, if you have bought coffee

or any other things using Capstone credit card,

my application is actually responsible for computing the earn

for whatever the spend you have done. So it is managed by

folks, and I am the 1 of the engineering managers for the application which does

those own competition.

Do you remember how you first got involved in the area of data management?

Currently, with Capstone, it's like the DNA of Capstone's

data in it. With the volume, what we currently manage, it's, compared to that from my old days when I started in my initial phase of the carrier, I was always associated with data.

In my previous,

role,

it was not to the same scale, what I'm dealing now. It was for the AT and T back end

router provisioning application,

and it just pretty much internal to them. But what I was primarily dealing is some sort of a ITIL process which

exposes

all the network provisioning

hours which have gone through.

After certain time, it was more into the router configuration,

and it was all dealing with the data

where,

say, Juniper router

or Cisco routers, configurations,

how it all

depicted

in a logical system compared to the physical

real network. That was the thing which I was previously doing. The application which actually has to process those data

and figure out the discards between these physical and logical

systems.

And say when someone is trying to provision,

it just validates

whether it both are in sync. So I have always been associated with, data.

But when I really

started looking into

different way of doing those processes, when I really

got into this

big data computing. And at the same time, is when the things were also progressing on big data or the distributed computing.

Apache Spark was coming along. Prior to that, Hadoop, we had. So all these things, they're converging at 1 point, and that's when I really jumped into this

big data ocean. And from then, I started doing large things for Capital 1. It's all

mostly data intensive

and the volume of data we manage. It may not be to the terabyte terabyte scale. Yes. We do manage those things, but are not on a daily basis. So

pretty much I have done a lot of things from my initial days on data, which I continue to do even now in Capstone as well. That brings us to the work that you're doing today. And I'm wondering if you can just start by giving a bit of an overview about some of the types of data that you're dealing with and some of the workflows that you're responsible for in the work that you're doing at Capital 1.

So on a very high level, as I was saying, we

we are a system which manageable

for,

customer credit card processing. So which pretty much utilizes

Apache Spark for most of its processing. And there are some orchestration piece. And if if you just take on a very high level, 10, 000 foot view, it's more like we receive customer credit card transactions, and there is a big competition which happens. And then we process those things. And that's a very high level workflow. On the technology base, what we deal is mostly Apache Spark, and there are a lot of the back end, local database, those kind of things. That's what our workflow is. That's a very high level. The primary business case is computing

on and

surfacing that information for customers

in different means.

And when it comes to type of data we deal, it's pretty much, the specific data format in which we receive. That's somewhat

specific.

But beyond that, we we deal with a lot of JSONs and we process a lot of information. The the

enriching or filtering data also gets into more into the parquet side of it.

Probably, those are all high level data types we deal with, and the workflow is

what we do primarily is

we get the current transaction, compute,

persist. That's a very high level blocks.

Is what we deal in at least in my application currently. In terms of the

kind of volume of data, you mentioned that you're you're dealing with a lot of parquet files. So I'm assuming that you're primarily sourcing and depositing information in s 3. I'm wondering what you are looking at in terms of the sort of 3 v's of volume, variety, and velocity

as to the magnitude of each of those dimensions

and some of the ways that that magnitude

might complicate the work that you're doing or add some additional sort of detail oriented necessity to how you think about approaching those workflows?

So

so if if we get into the 3 v's part, it's more like the volume wise, it ranges from probably

25 to 50, 000, 000 daily,

and that's what we typically process. That's the volume we process. And then

that volume

goes through variety of

workflows.

Right? So it ranges from a simple use case where

if we're familiar with, Capital 1 cards, the simple case where customer swipes using 1 of our cashback card, which is Quicksilver,

where whatever customers spend, they get 1.5

times of whatever they spend.

That is

ranging from there to

customer getting a specific bonus or their depending on their portfolio. The variety is

it ranges from various cases and there is anniversary bonus. Those kind of business cases are there. And when it comes to the velocity,

it's more like we do have the mix of things. But for this use case, it's more specifically,

it happens per day. That's the way we we put together things, and that volume is

it ranges from 25 to 50, 000, 000.

And sometimes it even exceeds the for our conversation, we can say, like, 25 to 50, 000, 000 is the processing volume we actually deal. And because of the fact that you're doing all this data processing within the boundaries of a banking organization, which is subject to a lot of financial regulations and scrutiny. I'm wondering what are some of the business requirements and regulatory requirements

that factor into the types of operations that you're doing, how you need to think about auditability of those workflows,

the sort of level of accuracy that's necessary, and some of the ways that you need to

verify and validate the accuracy of the computations that you're performing?

To the answer that,

we operate in

a highly regulated,

environment,

and we need to account for lot of auditing documents. And we being a source of truth,

we have obligation to produce those documents.

And in simple way,

put that as,

are we always doing

what we said we will do?

Right? This is, like, a simple 1 liner, but all our decisions

has to be viewed in this lens. And

the example which we are about to discuss is also a primary

reason of

implementing that. The enriching pattern was also, after fact of

this kind of need which arise.

Because in some case, we have to answer

what happened. Simple cases, customer can call and then say, hey. Why I didn't get this earned for this particular spend? And we are answerable to them. Beyond that, we are answerable to regulators.

In lot of cases,

in this scale of volume, we need to see

what has happened to that particular transaction. That is 1 of the simple

thing we have. So the the lot of designs we actually tackle

is also

mainly related to those kind of things. And as I said,

we process

25 to 15, 000, 000 transactions,

and we need to see

what has happened to each

1 transaction in case if some question arises in various

forms.

Right? And then why that particular transaction hasn't earned? So we are answerable to all these things, and most of our

design always

accounts for

how better we can answer for all these kind of needs.

Enriching

is 1 of the things, and there are various soft related things we do tackle in our whole pipeline.

Those are the things which

indirectly produces those documents which I was mentioning.

So the data assets that you're producing, I'm wondering

who are the downstream consumers, whether it's

business end users, if it's other application teams, data scientists.

And

how do those various consumers

influence the way that you think about the

format that you need to generate your outputs

in? It's a combination of both.

Right? We being the source of truth,

we are actually responsible for

producing the data, which is mainly

a needed 1 across the enterprise. So I'll group them all as our internal

customers.

The information what we produce

is also used by lot of our

DAs.

And

they all fall into the internal categories, and there are there are various depending on their use cases,

all those kind of things. Because of the reason

they produce the rich data, they are there is

a need within the enterprise.

They are all internal clients.

And

coming to the external,

the simple 1 is customer get the statements.

Right? Which is also 1 of the internal.

But similar way,

let's take,

Costco coupon. Right? Which most of people here probably can associate, which goes out every year for the customer

depending on whatever they earned in the previous year.

So if you see that use case,

customers are actually getting

rewards in that coupon.

So we being the system of truth, we are actually the rest we do have similar kind of things in our

processing,

and we do need to surface that information

to our external

partners as well. And that actually gets into their systems, and that's how this flow also works. So we do have

mix of both.

We have, internal customers and the external cons customers.

Internal being there are various use cases because this data

goes into

all this kind of warehousing needs and some specific transactional needs. We do have all these use cases internally as well. I mean, externally,

I was just trying to view the Costco coupon example. We do have

similar sort of things for external clients as well. So the data we produce is actually utilized by both.

In terms of the

technical elements, I'm wondering if you can just talk through some of the system architecture

and system design and some of the constraints that you're dealing with as far as how to think about designing the jobs that you're working with and the ways that you're leaning on the tools that are available to you and now you're leaning on the tools that are available to you?

So our our, data pipeline, primarily, as well as previously mentioning users,

Apache Spark has its

processing framework. And we do have mix of server and serverless options in our pipeline, depending on the use cases. Both are in play. And

we use Apache Spark under various ways.

If you see, I think we we do have mix of things.

At least the application which I'm responsible for, we do have both the use case where batch and real time are being used. But across enterprise, if you see, we do have

batch real time, and there are ML related

use cases. But when it comes to the constraints,

I think over the period, we had as things which we have to specifically design for our needs.

You mentioned

that you're leaning heavily on Spark. And I know that you wrote a post a little while ago and presented at Databricks conference to talk about some of the patterns that you started with and evolved to to handle some of the

specific requirements

of your processing workflow and the regulatory

capabilities that are necessary to be able to support some of that auditing. But before we get into that, I'm wondering if you can

just talk to some of the other ways that Spark is being used throughout Capital 1

and how you're able to maybe lean on some of the established patterns in the existing infrastructure and data processing kind of aggregate knowledge of the organization?

Across the enterprise, we do have various use cases. We use part part being the industry leader in the distributed computing.

It is used heavily

across our enterprise.

Right? For a simple batch workload,

we consume some file. There are various ETL related jobs, and there are even transactional

like ours. There are heavily transactional based

processing, which also leverages batch spark.

Some real time based

and hybrid way of processing as so I have seen, which is more like mix of both

batch and real time workloads,

solving some real business use case where customer

tips, and then you they get immediately a notification. This is actually a hybrid

implementation

across enterprise. There is even

some specific

ML related things, which I am not very close, but I have, seen that those are also being used. So we have

variety of use cases

within Capital 1. We use Spark heavily.

In terms of your specific workflow, you mentioned that you're dealing with processing some of of the purchase information to calculate what the rewards are that are going to be available to a customer.

You know, that as opposed to a fraud situation, I'm sure that there are different requirements in terms of the latency that's acceptable for end to end processing.

And I'm wondering if there are sort of prioritizations

or different spark clusters that are available to make sure that, you know, the fraud analysis doesn't get held up behind the rewards computation or any sort of kind of priority scheduling across the different Spark infrastructure that's available and how that manifests in terms of what the processing capacity is at any given point? At least in our world, these 2 doesn't really meet the caller in the sense

the case where I was explaining is it's a hybrid case where let's say, even the fraud is also involved in that,

those actually it doesn't really get into the hybrid case itself. And we being a transactional processing,

it's more on not on a real time is what we actually do. We there are mix of those things. But for our use case, we don't really collide with the sharing of resource, those kind of things, because it's not something like a cube where it's heavily shared.

Instead, it's more

driven by their use case on how they manage the infrastructure. They bring up their compute and then they're done with those kind of things is how

it doesn't collide,

particularly in this use case. They're testing on fraud or computing using customer information or

immediately

responding to some of the information from the authorization,

those kind of things. They all doesn't collect,

basically.

And so digging now into the specifics of your workflow and some of the learning that you did in the process of starting with the initial implementation to where you are now. I'm wondering if you can just talk to some of the specific requirements that were in place as far as

the information that you needed to be able to compute the information that was necessary to maintain throughout those various stages of computation

and the initial

design that you developed, and

what were the priorities that you were focusing on when you landed on that initial structure? Maybe I'll

just dig in more or give a high level mission. What is the use case? Then that will help we can trace this.

Right?

So the high level use cases, as I said, we process the customer the transactions, and there are

multiple eligibility things we need to ensure before they really get into their own competition. Right? Something like

account eligibility.

Simple 1 is that this account is really eligible to get rewards. Right? And there are various transactional eligibilities, like, to say, they're a balance transfer. That is also 1 of the thing which a customer does. But, typically, those things customer may not be earning rewards. So just to connect things, that's also falls into the transaction eligibility case.

And all these things, once we do the eligibility

and then see, yeah, now the transaction is eligible.

They are eligible to get the earn.

We will complete the earn on those, and then processing our data stores

for various

processing use cases. Right? Once customer logging into their,

account through web page or through mobile

to see what are the transaction they had or how much rewards they have, all those kinds of things. Our customer getting the statements, which has the rewards information.

This is a very high level

platform

depiction. Right? And our initial design

actually was using a a pattern called filtering pattern in

Bacchus Park. That filtering pattern uses

inner John. Between simple terms, 2 different

datasets are being joined for a different need. In this example, let's take there there is a credit card transaction, and we are trying to see

out of those,

are there any

ineligible

by based on account or transaction.

That's

1 of these joint. And if you use the

filtering logic, which was our initial

case,

to give you an example of that filtering, how

it was done in the scale, but to just

dilute that for a simple explanation.

Say we had 10 transactions

when we started,

but when we did the account eligibility out of that,

there are only 5 transactions

which are actually

based on account eligibility,

we need to determine the transaction eligibility. When we try to determine the transaction eligibility of those 5, we got only 3 because we are filtering at each stage.

So, finally, we let's say we end up with the 2 transactions, which are really

eligible to compute on, and then we end up

with computing on for those 2 transaction. This is what

I'm to explain as a filtering,

where in each stage of this computation,

we filter out the transactions.

Which means,

if you take the dataset as a row and column,

it actually shrinks in rows.

When we started with the number of rows as 10, we finally ending up with 2 rows

as eligible transaction.

So that was actually our initial implementation,

and we went live with this.

And, immediately, we realized

what was the major shortcoming

for that. 1,

in retrospect,

if we have to back trace

any 1 transaction,

right, for those,

why it got filtered or which stage it got filtered.

For people who are very familiar with Spark, they can say, hey. You can do count at each state, which is actually a costly operation when it comes to the Spark.

We can still do that, but it will only give us

how many actually

got filtered out at each stage.

Instead, if you want to get into the granularity

for for the regulator need or

traceability need,

you need to know

not how many which got filtered out. We also need to know

why it got filtered out. That

with the scale of our operation,

say, 50, 000, 000

of transaction processing, we need to figure out, hey. What has happened to that particular transaction?

That granularity

is what was lacking in the

filtering approach. That we immediately

found out as a shortcoming, and then we

immediately switch over to a pattern,

which is first

mostly

either

enriching pattern. The enriching pattern,

the information

actually grows

column wise. In the sense, when we start with the 10 transaction, the same example,

we start with 10. At each state, instead of filtering out the transactions,

we will keep enriching the dataset so that

there is enough information to make decision. In the same example,

let's take 10 transactions.

When we try to see how many are really eligible accounts out of this, instead of filtering them out,

we will bring in the information and enrich the dataset.

Same thing for transaction eligibility.

Finally, when we try to see

who are all not really eligible using the information we gathered at each stage,

number of rows wise, it stays at 10.

But number of column wise, it grows because we have been enriching at each stage.

So, finally, using all the enriched information,

we try to make a determination. Okay. We see that this

are not eligible accounts, and these are not eligible transaction.

Finally, end result, we will end up with the same 2 transactions,

But how we are reaching to that state

is what makes a difference

based on their use case. Right? There may be some valid use case where people want to compute

pretty quickly

in memory.

Maybe some cases, filtering helps.

In few cases, like ours, where we need

back traceability,

granular details,

for those kind of use cases,

enriching option fits better

because at any point of time, you can go back to your, parking information and see, hey, what has happened to that 1 particular transaction? Oh, this was the reason. At that point of time,

this account was actually ineligible to get rewards. That is the reason they haven't got rewards. So that is what actually on a very high level this 2 compares against.

Given the fact that you need to persist this information at various stages, I'm wondering if you have looked at leaning on anything such as

the Delta Lake format for being able to materialize this information as a set of tables or any other sort of data warehouse style with, you know, maybe history tables to be able to maintain the different states of computation at those various stages. Or

if the sort of auditability

is only there for

sort of regulatory purposes, you don't necessarily need keep it live and queryable in a easy to access format. You just need to be able to maintain that history and perhaps be able to compact it over time.

For this particular use case,

we haven't really persisted at each stage.

What we have been doing is we have been enriching

so the dataset actually grows column wise.

So when it comes to the performance,

we haven't really seen any

major

performance degradation even for the same scale I was explaining. For 50, 000, 000

transactions, we haven't really seen because

it's a similar dataset, whether it's a 25, 000, 000 or it's only that delta competition.

The

the column is growing.

The information stays same. So

persisting really hasn't, helped us, but when

when you brought in this Delta Lake, those kind of things, without those things are very recent things which has happened in Spark, so we are definitely evaluating those things. But in our case, it

self, this is actually a transactional processing.

We do push this information for all app cases. There, this information gets out and there is, behind the scenes, some sort of use cases where it utilizes this information and take some reactive actions.

That is also

is there. But particularly for this transactional use case,

we haven't persisted and we haven't seen any major performance degradation. So this is 1 of the process which is in the transactional pipeline.

So we have just kept it as it is because we haven't really seen major performance degradation.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done.

DataFold's proactive approach to data quality helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection.

DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values.

DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.

Visitdataengineeringpodcast.com/datafold

today to book a demo with DataFold.

And as far as the kind of performance and storage aspects, as you mentioned with the filtering approach, everything's done in memory, so it's all

ephemeral. So you don't have to worry about persisting this information. But

because of the fact that you do need to have that audit trail, you wanna see what was the data at every point, what were the decisions that were made, and why

you need to be able to persist that information across these various computation stages. And I'm wondering,

what are some of the impacts on the overall data volume that are produced by these jobs going from the filtering approach to the enrichment approach

and some of the ways that latency is impacted as well and just some of the maybe data quality checks that you've put in to ensure that you don't end up with any sort of regressions

as you evolve the job?

There is definitely a delta increment to your whole competition

Because as you said, the previous approach,

everything was computed in memory. So the comparatively, that is squeak. But when it comes to the enriching option,

we are growing the dataset

to the column level. But what we actually

do is we capture a lot of this information,

and that information is used

to make the determination at 1 stage. That is also 1 of the stage. So

that there is only a slight delta increment

for pure competition.

So

for this particular use case, we had just seen,

which is actually a priority for us. Right? The delta,

can we live with it

compared to the

advantages it provides for our use case? So we are just living with that additional delta. But as I was saying, even the 50, 000, 000 processing,

the delta is not

too much. Right? So the benefit it provides actually always that

delta processing. So that's what we actually

are using, and we are not seeing major latency, those kind of things.

As you have been working on this job and you went from this filtering to the enrichment approach, I'm wondering what are some of the other

optimizations or improvements to this workflow that you're planning to build in or any sort of

validation or data quality checking or sort of additional

quality control that you're adding in to ensure that you have the auditability

and maybe some of the tooling that you're building to ensure that

similar jobs or similar requirements in the future are able to benefit from what you learned in the process of going through this 1 workflow?

After we realized the benefits of this pattern of enriching the data,

we actually have after that, whatever we implemented, we have adopted this pattern as our standard

pattern

for any of our data pipeline. So the data pipeline we are serving is a bit involved process. There is lot of sub pipelines,

and there are a lot of

information and determination

done at various stage, and this sub pipeline contributes to the main pipeline. So

whatever the initial

assessment we did on the implementation,

that was on 1 of the main pipeline. And after that, whatever the sub pipeline we have actually implemented, we have adopted this information

across board. And that is also 1 of the things which we use

as a standard.

And

coming to the

optimization or improvements,

there are various learnings

we have heard

when it comes to this whole pipeline.

And, actually, I have

recently published

1 of that as

learnings which we had

from having NoSQL back ends for

Spark apps. Right? There are few things which we had to particularly tweak

when it comes to having

some sort of NoSQL database where

compared to our old traditional

data stores, which had

transaction.

All those things are completely

I won't say completely short.

The industry is now moving more towards NoSQL,

something like a NewSQL. All those kind of things, we have made some tweaks, which I have already shared

the topics for another day. But there are some optimization

we have actually done in the whole pipeline

beyond adoption of this as our standard pattern.

In terms of the work that you're doing with Spark, what are some of the cases where you've run up against some of the limitations

there as far as what you're able to do with it and some of

the workarounds that you've built or some of the times that you've gone to other tools because Spark just wasn't the right fit for the requirements that you had? 1 thing which comes to mind immediately is the data types, in certain cases, in our dataset

has actually caused some problems for us when we are trying

to join some information

and do some computation. Certain data types has actually

caused some problems like time stamps because we do

manage a lot of these monetary related things.

And

those kind of things is where I have seen some specific

I won't say it's a limitation.

It's not a trivial way to do those things. That is what probably, I may say, is a limitation we had when we were doing this whole pipeline.

But which we have actually overcome by

adopting

UDS.

And Spark comes to user defined functions, which gives you the

flexibility of whatever the the specific

way your data needs to be processed. You can

package them as a particular function, and that function

does things however you want. So we have adopted for those kind of use cases

as UDS.

As far as the other ways that you've seen Spark used or other ways that you've used Spark while at Capital 1 or maybe in other jobs as well, what are some of the most interesting or innovative or unexpected ways that you've seen it applied?

We do have various use cases, and we are definitely

a spark heavy shop. There are various use cases we do. When it comes to the batch, real time,

machine learning.

There are a lot of tools. 1 which is closer to me, which I feel like much more interesting,

let's say innovative, is the combination

of

both

batch and real time for solving a particular

business case and

giving that customer

a real benefit. Right? Which is the 1 I was saying even before, which is, like, alerting customer

when they have,

tipped high.

We think that

if we see the same

transaction being authorized, we do see if there's similar attributes,

and we immediately alert the customer saying, hey. Is it a double swipe?

Are you really doing second swipe? Those kind of cases where it is actually provided as a functionality to Capital 1 customers,

which in the back end actually

does

use this hybrid model,

where some information actually comes from batch computation,

and there is definitely a real time need, which is also done using Spark Streaming,

where customer gets alerted using the information from both this processing.

So that is 1 which is just close to me, and I felt like that is

a interesting

solution

for

a customer need.

And in your own work of building on top of Spark and doing data engineering at Capital 1, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

Challenging, I would say, is when it comes to this whole distributed processing,

running a transactional

system, which leverages

Spark as its framework,

NoSQL

back end, does require lot of unique

customizations

and designs

to operate in a regulated environment

where we are dealing with customers' real money. And

managing those things

holistically

is somewhat

challenging

when it comes to the operational waste. Because

Spark is a distributed

competition

framework, and NoSQL on the other side is also

compared to our traditional data stores. They're also

known for scale because the information is also distributed.

Right? So in some cases, these 2 goes very well

when it comes to Spark and Cassandra. They they're they're known for processing things at the partition level very well. But when it comes to the Spark and Mongo,

how you leverage

some

information or the libraries which they provide

to reap the maximum benefit for a transactional

store,

it's somewhat challenging,

and it requests

unique design patterns

to have it as a resilient system.

As far as some of the upcoming projects that you're

working on at Capital 1, what are some of the types of challenges that you're excited to dig into? Or some of the

additional

strategies that you are planning to try out or lessons that you're looking to learn as you evolve the processing infrastructure

of the sort of specific data domain that you're responsible for?

I have been using Spark from very initial days, probably

well before it reached version 1.

And I have been even following the Kubernetes space for some time. So I do see that 2 of this massive

distributor

crossing,

1 is a framework platform,

are converging.

Right? That with the Spark's latest version, I think,

3.30

x. They even started natively supporting this. So the convergence of Q

and Spark

is 1 area where

I am definitely interested because moving from

VM based

processing into

container based

using Qube for managing all those things is something I'm really excited. I want to see all

our workloads

get into this pattern because that's where industry is moving.

So that convergence actually has lot of benefits,

and we can solve

the right of use cases. So that's 1 thing I'm really curious,

the convergence of these 2.

Are there any other aspects of the work that you're doing at Capital 1 and the ways that you're applying Spark to data management for these financial transactions that we didn't discuss yet that you'd like to cover before we close out the show? Well, I think the main idea is we have followed a particular pattern, and we have switched over to different pattern for our use case. My goal was to socialize this so that people can

make

somewhat informed decision for their use case. For some use case, it may be filtering. Some some other use case, it may be enrichment. Right? So the those along with other things, as I was saying, there are a lot of learnings we have had in this whole process, and I have shared in different forms.

So those things probably benefits for our audience.

Feel like those are the things I actually wanted to cover.

Well, for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

People can reach out to me in Twitter, and my handle is

goc0olgokul_p.

Or if they prefer LinkedIn, it's gokulp,

1 word. So they can reach out to me in Twitter or LinkedIn.

Answering the final question, it's more, like, throughout the process, 1 thing which I feel, like, maybe

can

be enhanced is developer tooling.

Because

Spark being a distributed

computation framework, when we are dealing with lot of this distributed processing, the developer tooling, where they can do a lot of this kind of things on their system.

I know Docker does, and we can get all these things onto Docker as well.

Little more

easier way of handling these things on their systems

probably will help developers productivity. That's 1 thing probably we can

improve upon. It's

what's in the stage.

Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing at Capital 1 and some of the lessons that you've learned as far as the different

processing patterns and the benefits and trade offs that they provide. So I appreciate the time and insight that you've been able to share, and I hope you enjoy the rest of your day. Thank you, Tobias. It was a pleasure, and thank you for the opportunity. You also have a wonderful day.

Listening. Don't forget to check out our other show, podcast.init

atpythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at dataengineeringpodcast

dotcom with your story. And to help other people find the show, please leave your view on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links