Summary
Spark is a powerful and battle tested framework for building highly scalable data pipelines. Because of its proven ability to handle large volumes of data Capital One has invested in it for their business needs. In this episode Gokul Prabagaren shares his use for it in calculating your rewards points, including the auditing requirements and how he designed his pipeline to maintain all of the necessary information through a pattern of data enrichment.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.
- Your host is Tobias Macey and today I’m interviewing Gokul Prabagaren about how he is using Spark for real-world workflows at Capital One
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by giving an overview of the types of data and workflows that you are responsible for at Capital one?
- In terms of the three "V"s (Volume, Variety, Velocity), what is the magnitude of the data that you are working with?
- What are some of the business and regulatory requirements that have to be factored into the solutions that you design?
- Who are the consumers of the data assets that you are producing?
- Can you describe the technical elements of the platform that you use for managing your data pipelines?
- What are the various ways that you are using Spark at Capital One?
- You wrote a post and presented at the Databricks conference about your experience moving from a data filtering to a data enrichment pattern for segmenting transactions. Can you give some context as to the use case and what your design process was for the initial implementation?
- What were the shortcomings to that approach/business requirements which led you to refactoring the approach to one that maintained all of the data through the different processing stages?
- What are some of the impacts on data volumes and processing latencies working with enriched data frames persisted between task steps?
- What are some of the other optimizations or improvements that you have made to that pipeline since you wrote the post?
- What are some of the limitations of Spark that you have experienced during your work at Capital One?
- How have you worked around them?
- What are the most interesting, innovative, or unexpected ways that you have seen Spark used at Capital One?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on data engineering at Capital One?
- What are some of the upcoming projects that you are focused on/excited for?
- How has your experience with the filtering vs. enrichment approach influenced your thinking on other projects that you work on?
Contact Info
- @gocool_p on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Outland as an internal tool for themselves. Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Gokul Prabhagaran about how he is using Spark for real world workflows at Capital 1 and some of the patterns that he's developed to help in his work. So, Gokul, can you start by introducing yourself? Thanks for, first of all, having me, Tobias, and
[00:02:12] Unknown:
I'm really excited to share, what we have done. And before we, I think, start, I'm Gokul Prabhavan. Currently, I'm a engineering manager at Capital 1. My primary responsibility is managing the platform and the application, which is responsible for processing customers' credit card transactions and computing the earn for those transactions. In simple terms, like, if you have bought coffee or any other things using Capstone credit card, my application is actually responsible for computing the earn for whatever the spend you have done. So it is managed by folks, and I am the 1 of the engineering managers for the application which does those own competition.
[00:02:59] Unknown:
Do you remember how you first got involved in the area of data management?
[00:03:02] Unknown:
Currently, with Capstone, it's like the DNA of Capstone's data in it. With the volume, what we currently manage, it's, compared to that from my old days when I started in my initial phase of the carrier, I was always associated with data. In my previous, role, it was not to the same scale, what I'm dealing now. It was for the AT and T back end router provisioning application, and it just pretty much internal to them. But what I was primarily dealing is some sort of a ITIL process which exposes all the network provisioning hours which have gone through.
After certain time, it was more into the router configuration, and it was all dealing with the data where, say, Juniper router or Cisco routers, configurations, how it all depicted in a logical system compared to the physical real network. That was the thing which I was previously doing. The application which actually has to process those data and figure out the discards between these physical and logical systems. And say when someone is trying to provision, it just validates whether it both are in sync. So I have always been associated with, data. But when I really started looking into different way of doing those processes, when I really got into this big data computing. And at the same time, is when the things were also progressing on big data or the distributed computing.
Apache Spark was coming along. Prior to that, Hadoop, we had. So all these things, they're converging at 1 point, and that's when I really jumped into this big data ocean. And from then, I started doing large things for Capital 1. It's all mostly data intensive and the volume of data we manage. It may not be to the terabyte terabyte scale. Yes. We do manage those things, but are not on a daily basis. So
[00:05:09] Unknown:
pretty much I have done a lot of things from my initial days on data, which I continue to do even now in Capstone as well. That brings us to the work that you're doing today. And I'm wondering if you can just start by giving a bit of an overview about some of the types of data that you're dealing with and some of the workflows that you're responsible for in the work that you're doing at Capital 1.
[00:05:29] Unknown:
So on a very high level, as I was saying, we we are a system which manageable for, customer credit card processing. So which pretty much utilizes Apache Spark for most of its processing. And there are some orchestration piece. And if if you just take on a very high level, 10, 000 foot view, it's more like we receive customer credit card transactions, and there is a big competition which happens. And then we process those things. And that's a very high level workflow. On the technology base, what we deal is mostly Apache Spark, and there are a lot of the back end, local database, those kind of things. That's what our workflow is. That's a very high level. The primary business case is computing on and surfacing that information for customers in different means.
And when it comes to type of data we deal, it's pretty much, the specific data format in which we receive. That's somewhat specific. But beyond that, we we deal with a lot of JSONs and we process a lot of information. The the enriching or filtering data also gets into more into the parquet side of it. Probably, those are all high level data types we deal with, and the workflow is what we do primarily is we get the current transaction, compute, persist. That's a very high level blocks. Is what we deal in at least in my application currently. In terms of the
[00:07:00] Unknown:
kind of volume of data, you mentioned that you're you're dealing with a lot of parquet files. So I'm assuming that you're primarily sourcing and depositing information in s 3. I'm wondering what you are looking at in terms of the sort of 3 v's of volume, variety, and velocity as to the magnitude of each of those dimensions and some of the ways that that magnitude might complicate the work that you're doing or add some additional sort of detail oriented necessity to how you think about approaching those workflows?
[00:07:28] Unknown:
So so if if we get into the 3 v's part, it's more like the volume wise, it ranges from probably 25 to 50, 000, 000 daily, and that's what we typically process. That's the volume we process. And then that volume goes through variety of workflows. Right? So it ranges from a simple use case where if we're familiar with, Capital 1 cards, the simple case where customer swipes using 1 of our cashback card, which is Quicksilver, where whatever customers spend, they get 1.5 times of whatever they spend. That is ranging from there to customer getting a specific bonus or their depending on their portfolio. The variety is it ranges from various cases and there is anniversary bonus. Those kind of business cases are there. And when it comes to the velocity, it's more like we do have the mix of things. But for this use case, it's more specifically, it happens per day. That's the way we we put together things, and that volume is it ranges from 25 to 50, 000, 000.
[00:08:41] Unknown:
And sometimes it even exceeds the for our conversation, we can say, like, 25 to 50, 000, 000 is the processing volume we actually deal. And because of the fact that you're doing all this data processing within the boundaries of a banking organization, which is subject to a lot of financial regulations and scrutiny. I'm wondering what are some of the business requirements and regulatory requirements that factor into the types of operations that you're doing, how you need to think about auditability of those workflows, the sort of level of accuracy that's necessary, and some of the ways that you need to verify and validate the accuracy of the computations that you're performing?
[00:09:21] Unknown:
To the answer that, we operate in a highly regulated, environment, and we need to account for lot of auditing documents. And we being a source of truth, we have obligation to produce those documents. And in simple way, put that as, are we always doing what we said we will do? Right? This is, like, a simple 1 liner, but all our decisions has to be viewed in this lens. And the example which we are about to discuss is also a primary reason of implementing that. The enriching pattern was also, after fact of this kind of need which arise. Because in some case, we have to answer what happened. Simple cases, customer can call and then say, hey. Why I didn't get this earned for this particular spend? And we are answerable to them. Beyond that, we are answerable to regulators.
In lot of cases, in this scale of volume, we need to see what has happened to that particular transaction. That is 1 of the simple thing we have. So the the lot of designs we actually tackle is also mainly related to those kind of things. And as I said, we process 25 to 15, 000, 000 transactions, and we need to see what has happened to each 1 transaction in case if some question arises in various forms. Right? And then why that particular transaction hasn't earned? So we are answerable to all these things, and most of our design always accounts for how better we can answer for all these kind of needs.
Enriching is 1 of the things, and there are various soft related things we do tackle in our whole pipeline. Those are the things which indirectly produces those documents which I was mentioning.
[00:11:29] Unknown:
So the data assets that you're producing, I'm wondering who are the downstream consumers, whether it's business end users, if it's other application teams, data scientists. And how do those various consumers influence the way that you think about the format that you need to generate your outputs
[00:11:48] Unknown:
in? It's a combination of both. Right? We being the source of truth, we are actually responsible for producing the data, which is mainly a needed 1 across the enterprise. So I'll group them all as our internal customers. The information what we produce is also used by lot of our DAs. And they all fall into the internal categories, and there are there are various depending on their use cases, all those kind of things. Because of the reason they produce the rich data, they are there is a need within the enterprise. They are all internal clients.
And coming to the external, the simple 1 is customer get the statements. Right? Which is also 1 of the internal. But similar way, let's take, Costco coupon. Right? Which most of people here probably can associate, which goes out every year for the customer depending on whatever they earned in the previous year. So if you see that use case, customers are actually getting rewards in that coupon. So we being the system of truth, we are actually the rest we do have similar kind of things in our processing, and we do need to surface that information to our external partners as well. And that actually gets into their systems, and that's how this flow also works. So we do have mix of both.
We have, internal customers and the external cons customers. Internal being there are various use cases because this data goes into all this kind of warehousing needs and some specific transactional needs. We do have all these use cases internally as well. I mean, externally, I was just trying to view the Costco coupon example. We do have similar sort of things for external clients as well. So the data we produce is actually utilized by both.
[00:13:49] Unknown:
In terms of the technical elements, I'm wondering if you can just talk through some of the system architecture and system design and some of the constraints that you're dealing with as far as how to think about designing the jobs that you're working with and the ways that you're leaning on the tools that are available to you and now you're leaning on the tools that are available to you?
[00:14:09] Unknown:
So our our, data pipeline, primarily, as well as previously mentioning users, Apache Spark has its processing framework. And we do have mix of server and serverless options in our pipeline, depending on the use cases. Both are in play. And we use Apache Spark under various ways. If you see, I think we we do have mix of things. At least the application which I'm responsible for, we do have both the use case where batch and real time are being used. But across enterprise, if you see, we do have batch real time, and there are ML related use cases. But when it comes to the constraints, I think over the period, we had as things which we have to specifically design for our needs.
[00:14:58] Unknown:
You mentioned that you're leaning heavily on Spark. And I know that you wrote a post a little while ago and presented at Databricks conference to talk about some of the patterns that you started with and evolved to to handle some of the specific requirements of your processing workflow and the regulatory capabilities that are necessary to be able to support some of that auditing. But before we get into that, I'm wondering if you can just talk to some of the other ways that Spark is being used throughout Capital 1 and how you're able to maybe lean on some of the established patterns in the existing infrastructure and data processing kind of aggregate knowledge of the organization?
[00:15:38] Unknown:
Across the enterprise, we do have various use cases. We use part part being the industry leader in the distributed computing. It is used heavily across our enterprise. Right? For a simple batch workload, we consume some file. There are various ETL related jobs, and there are even transactional like ours. There are heavily transactional based processing, which also leverages batch spark. Some real time based and hybrid way of processing as so I have seen, which is more like mix of both batch and real time workloads, solving some real business use case where customer tips, and then you they get immediately a notification. This is actually a hybrid implementation across enterprise. There is even some specific ML related things, which I am not very close, but I have, seen that those are also being used. So we have variety of use cases within Capital 1. We use Spark heavily.
[00:16:43] Unknown:
In terms of your specific workflow, you mentioned that you're dealing with processing some of of the purchase information to calculate what the rewards are that are going to be available to a customer. You know, that as opposed to a fraud situation, I'm sure that there are different requirements in terms of the latency that's acceptable for end to end processing. And I'm wondering if there are sort of prioritizations or different spark clusters that are available to make sure that, you know, the fraud analysis doesn't get held up behind the rewards computation or any sort of kind of priority scheduling across the different Spark infrastructure that's available and how that manifests in terms of what the processing capacity is at any given point? At least in our world, these 2 doesn't really meet the caller in the sense
[00:17:30] Unknown:
the case where I was explaining is it's a hybrid case where let's say, even the fraud is also involved in that, those actually it doesn't really get into the hybrid case itself. And we being a transactional processing, it's more on not on a real time is what we actually do. We there are mix of those things. But for our use case, we don't really collide with the sharing of resource, those kind of things, because it's not something like a cube where it's heavily shared. Instead, it's more driven by their use case on how they manage the infrastructure. They bring up their compute and then they're done with those kind of things is how it doesn't collide, particularly in this use case. They're testing on fraud or computing using customer information or immediately responding to some of the information from the authorization, those kind of things. They all doesn't collect, basically.
[00:18:23] Unknown:
And so digging now into the specifics of your workflow and some of the learning that you did in the process of starting with the initial implementation to where you are now. I'm wondering if you can just talk to some of the specific requirements that were in place as far as the information that you needed to be able to compute the information that was necessary to maintain throughout those various stages of computation and the initial design that you developed, and what were the priorities that you were focusing on when you landed on that initial structure? Maybe I'll
[00:19:00] Unknown:
just dig in more or give a high level mission. What is the use case? Then that will help we can trace this. Right? So the high level use cases, as I said, we process the customer the transactions, and there are multiple eligibility things we need to ensure before they really get into their own competition. Right? Something like account eligibility. Simple 1 is that this account is really eligible to get rewards. Right? And there are various transactional eligibilities, like, to say, they're a balance transfer. That is also 1 of the thing which a customer does. But, typically, those things customer may not be earning rewards. So just to connect things, that's also falls into the transaction eligibility case. And all these things, once we do the eligibility and then see, yeah, now the transaction is eligible.
They are eligible to get the earn. We will complete the earn on those, and then processing our data stores for various processing use cases. Right? Once customer logging into their, account through web page or through mobile to see what are the transaction they had or how much rewards they have, all those kinds of things. Our customer getting the statements, which has the rewards information. This is a very high level platform depiction. Right? And our initial design actually was using a a pattern called filtering pattern in Bacchus Park. That filtering pattern uses inner John. Between simple terms, 2 different datasets are being joined for a different need. In this example, let's take there there is a credit card transaction, and we are trying to see out of those, are there any ineligible by based on account or transaction.
That's 1 of these joint. And if you use the filtering logic, which was our initial case, to give you an example of that filtering, how it was done in the scale, but to just dilute that for a simple explanation. Say we had 10 transactions when we started, but when we did the account eligibility out of that, there are only 5 transactions which are actually based on account eligibility, we need to determine the transaction eligibility. When we try to determine the transaction eligibility of those 5, we got only 3 because we are filtering at each stage. So, finally, we let's say we end up with the 2 transactions, which are really eligible to compute on, and then we end up with computing on for those 2 transaction. This is what I'm to explain as a filtering, where in each stage of this computation, we filter out the transactions.
Which means, if you take the dataset as a row and column, it actually shrinks in rows. When we started with the number of rows as 10, we finally ending up with 2 rows as eligible transaction. So that was actually our initial implementation, and we went live with this. And, immediately, we realized what was the major shortcoming for that. 1, in retrospect, if we have to back trace any 1 transaction, right, for those, why it got filtered or which stage it got filtered. For people who are very familiar with Spark, they can say, hey. You can do count at each state, which is actually a costly operation when it comes to the Spark. We can still do that, but it will only give us how many actually got filtered out at each stage.
Instead, if you want to get into the granularity for for the regulator need or traceability need, you need to know not how many which got filtered out. We also need to know why it got filtered out. That with the scale of our operation, say, 50, 000, 000 of transaction processing, we need to figure out, hey. What has happened to that particular transaction? That granularity is what was lacking in the filtering approach. That we immediately found out as a shortcoming, and then we immediately switch over to a pattern, which is first mostly either enriching pattern. The enriching pattern, the information actually grows column wise. In the sense, when we start with the 10 transaction, the same example, we start with 10. At each state, instead of filtering out the transactions, we will keep enriching the dataset so that there is enough information to make decision. In the same example, let's take 10 transactions.
When we try to see how many are really eligible accounts out of this, instead of filtering them out, we will bring in the information and enrich the dataset. Same thing for transaction eligibility. Finally, when we try to see who are all not really eligible using the information we gathered at each stage, number of rows wise, it stays at 10. But number of column wise, it grows because we have been enriching at each stage. So, finally, using all the enriched information, we try to make a determination. Okay. We see that this are not eligible accounts, and these are not eligible transaction.
Finally, end result, we will end up with the same 2 transactions, But how we are reaching to that state is what makes a difference based on their use case. Right? There may be some valid use case where people want to compute pretty quickly in memory. Maybe some cases, filtering helps. In few cases, like ours, where we need back traceability, granular details, for those kind of use cases, enriching option fits better because at any point of time, you can go back to your, parking information and see, hey, what has happened to that 1 particular transaction? Oh, this was the reason. At that point of time, this account was actually ineligible to get rewards. That is the reason they haven't got rewards. So that is what actually on a very high level this 2 compares against.
[00:25:28] Unknown:
Given the fact that you need to persist this information at various stages, I'm wondering if you have looked at leaning on anything such as the Delta Lake format for being able to materialize this information as a set of tables or any other sort of data warehouse style with, you know, maybe history tables to be able to maintain the different states of computation at those various stages. Or if the sort of auditability is only there for sort of regulatory purposes, you don't necessarily need keep it live and queryable in a easy to access format. You just need to be able to maintain that history and perhaps be able to compact it over time.
[00:26:08] Unknown:
For this particular use case, we haven't really persisted at each stage. What we have been doing is we have been enriching so the dataset actually grows column wise. So when it comes to the performance, we haven't really seen any major performance degradation even for the same scale I was explaining. For 50, 000, 000 transactions, we haven't really seen because it's a similar dataset, whether it's a 25, 000, 000 or it's only that delta competition. The the column is growing. The information stays same. So persisting really hasn't, helped us, but when when you brought in this Delta Lake, those kind of things, without those things are very recent things which has happened in Spark, so we are definitely evaluating those things. But in our case, it self, this is actually a transactional processing.
We do push this information for all app cases. There, this information gets out and there is, behind the scenes, some sort of use cases where it utilizes this information and take some reactive actions. That is also is there. But particularly for this transactional use case, we haven't persisted and we haven't seen any major performance degradation. So this is 1 of the process which is in the transactional pipeline. So we have just kept it as it is because we haven't really seen major performance degradation.
[00:27:37] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold's proactive approach to data quality helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection. DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values.
DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows. Visitdataengineeringpodcast.com/datafold today to book a demo with DataFold. And as far as the kind of performance and storage aspects, as you mentioned with the filtering approach, everything's done in memory, so it's all ephemeral. So you don't have to worry about persisting this information. But because of the fact that you do need to have that audit trail, you wanna see what was the data at every point, what were the decisions that were made, and why you need to be able to persist that information across these various computation stages. And I'm wondering, what are some of the impacts on the overall data volume that are produced by these jobs going from the filtering approach to the enrichment approach and some of the ways that latency is impacted as well and just some of the maybe data quality checks that you've put in to ensure that you don't end up with any sort of regressions as you evolve the job?
[00:29:19] Unknown:
There is definitely a delta increment to your whole competition Because as you said, the previous approach, everything was computed in memory. So the comparatively, that is squeak. But when it comes to the enriching option, we are growing the dataset to the column level. But what we actually do is we capture a lot of this information, and that information is used to make the determination at 1 stage. That is also 1 of the stage. So that there is only a slight delta increment for pure competition. So for this particular use case, we had just seen, which is actually a priority for us. Right? The delta, can we live with it compared to the advantages it provides for our use case? So we are just living with that additional delta. But as I was saying, even the 50, 000, 000 processing, the delta is not too much. Right? So the benefit it provides actually always that delta processing. So that's what we actually are using, and we are not seeing major latency, those kind of things.
[00:30:29] Unknown:
As you have been working on this job and you went from this filtering to the enrichment approach, I'm wondering what are some of the other optimizations or improvements to this workflow that you're planning to build in or any sort of validation or data quality checking or sort of additional quality control that you're adding in to ensure that you have the auditability and maybe some of the tooling that you're building to ensure that similar jobs or similar requirements in the future are able to benefit from what you learned in the process of going through this 1 workflow?
[00:31:04] Unknown:
After we realized the benefits of this pattern of enriching the data, we actually have after that, whatever we implemented, we have adopted this pattern as our standard pattern for any of our data pipeline. So the data pipeline we are serving is a bit involved process. There is lot of sub pipelines, and there are a lot of information and determination done at various stage, and this sub pipeline contributes to the main pipeline. So whatever the initial assessment we did on the implementation, that was on 1 of the main pipeline. And after that, whatever the sub pipeline we have actually implemented, we have adopted this information across board. And that is also 1 of the things which we use as a standard.
And coming to the optimization or improvements, there are various learnings we have heard when it comes to this whole pipeline. And, actually, I have recently published 1 of that as learnings which we had from having NoSQL back ends for Spark apps. Right? There are few things which we had to particularly tweak when it comes to having some sort of NoSQL database where compared to our old traditional data stores, which had transaction. All those things are completely I won't say completely short. The industry is now moving more towards NoSQL, something like a NewSQL. All those kind of things, we have made some tweaks, which I have already shared the topics for another day. But there are some optimization we have actually done in the whole pipeline beyond adoption of this as our standard pattern.
[00:32:51] Unknown:
In terms of the work that you're doing with Spark, what are some of the cases where you've run up against some of the limitations there as far as what you're able to do with it and some of the workarounds that you've built or some of the times that you've gone to other tools because Spark just wasn't the right fit for the requirements that you had? 1 thing which comes to mind immediately is the data types, in certain cases, in our dataset
[00:33:17] Unknown:
has actually caused some problems for us when we are trying to join some information and do some computation. Certain data types has actually caused some problems like time stamps because we do manage a lot of these monetary related things. And those kind of things is where I have seen some specific I won't say it's a limitation. It's not a trivial way to do those things. That is what probably, I may say, is a limitation we had when we were doing this whole pipeline. But which we have actually overcome by adopting UDS.
And Spark comes to user defined functions, which gives you the flexibility of whatever the the specific way your data needs to be processed. You can package them as a particular function, and that function does things however you want. So we have adopted for those kind of use cases as UDS.
[00:34:16] Unknown:
As far as the other ways that you've seen Spark used or other ways that you've used Spark while at Capital 1 or maybe in other jobs as well, what are some of the most interesting or innovative or unexpected ways that you've seen it applied?
[00:34:28] Unknown:
We do have various use cases, and we are definitely a spark heavy shop. There are various use cases we do. When it comes to the batch, real time, machine learning. There are a lot of tools. 1 which is closer to me, which I feel like much more interesting, let's say innovative, is the combination of both batch and real time for solving a particular business case and giving that customer a real benefit. Right? Which is the 1 I was saying even before, which is, like, alerting customer when they have, tipped high. We think that if we see the same transaction being authorized, we do see if there's similar attributes, and we immediately alert the customer saying, hey. Is it a double swipe?
Are you really doing second swipe? Those kind of cases where it is actually provided as a functionality to Capital 1 customers, which in the back end actually does use this hybrid model, where some information actually comes from batch computation, and there is definitely a real time need, which is also done using Spark Streaming, where customer gets alerted using the information from both this processing. So that is 1 which is just close to me, and I felt like that is a interesting solution for a customer need.
[00:35:57] Unknown:
And in your own work of building on top of Spark and doing data engineering at Capital 1, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:36:07] Unknown:
Challenging, I would say, is when it comes to this whole distributed processing, running a transactional system, which leverages Spark as its framework, NoSQL back end, does require lot of unique customizations and designs to operate in a regulated environment where we are dealing with customers' real money. And managing those things holistically is somewhat challenging when it comes to the operational waste. Because Spark is a distributed competition framework, and NoSQL on the other side is also compared to our traditional data stores. They're also known for scale because the information is also distributed.
Right? So in some cases, these 2 goes very well when it comes to Spark and Cassandra. They they're they're known for processing things at the partition level very well. But when it comes to the Spark and Mongo, how you leverage some information or the libraries which they provide to reap the maximum benefit for a transactional store, it's somewhat challenging, and it requests unique design patterns to have it as a resilient system.
[00:37:25] Unknown:
As far as some of the upcoming projects that you're working on at Capital 1, what are some of the types of challenges that you're excited to dig into? Or some of the additional strategies that you are planning to try out or lessons that you're looking to learn as you evolve the processing infrastructure of the sort of specific data domain that you're responsible for?
[00:37:48] Unknown:
I have been using Spark from very initial days, probably well before it reached version 1. And I have been even following the Kubernetes space for some time. So I do see that 2 of this massive distributor crossing, 1 is a framework platform, are converging. Right? That with the Spark's latest version, I think, 3.30 x. They even started natively supporting this. So the convergence of Q and Spark is 1 area where I am definitely interested because moving from VM based processing into container based using Qube for managing all those things is something I'm really excited. I want to see all our workloads get into this pattern because that's where industry is moving.
So that convergence actually has lot of benefits, and we can solve the right of use cases. So that's 1 thing I'm really curious, the convergence of these 2.
[00:38:55] Unknown:
Are there any other aspects of the work that you're doing at Capital 1 and the ways that you're applying Spark to data management for these financial transactions that we didn't discuss yet that you'd like to cover before we close out the show? Well, I think the main idea is we have followed a particular pattern, and we have switched over to different pattern for our use case. My goal was to socialize this so that people can
[00:39:19] Unknown:
make somewhat informed decision for their use case. For some use case, it may be filtering. Some some other use case, it may be enrichment. Right? So the those along with other things, as I was saying, there are a lot of learnings we have had in this whole process, and I have shared in different forms. So those things probably benefits for our audience. Feel like those are the things I actually wanted to cover.
[00:39:43] Unknown:
Well, for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:39:57] Unknown:
People can reach out to me in Twitter, and my handle is goc0olgokul_p. Or if they prefer LinkedIn, it's gokulp, 1 word. So they can reach out to me in Twitter or LinkedIn. Answering the final question, it's more, like, throughout the process, 1 thing which I feel, like, maybe can be enhanced is developer tooling. Because Spark being a distributed computation framework, when we are dealing with lot of this distributed processing, the developer tooling, where they can do a lot of this kind of things on their system. I know Docker does, and we can get all these things onto Docker as well.
Little more easier way of handling these things on their systems probably will help developers productivity. That's 1 thing probably we can improve upon. It's what's in the stage.
[00:40:55] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing at Capital 1 and some of the lessons that you've learned as far as the different processing patterns and the benefits and trade offs that they provide. So I appreciate the time and insight that you've been able to share, and I hope you enjoy the rest of your day. Thank you, Tobias. It was a pleasure, and thank you for the opportunity. You also have a wonderful day.
[00:41:23] Unknown:
Listening. Don't forget to check out our other show, podcast.init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at dataengineeringpodcast dotcom with your story. And to help other people find the show, please leave your view on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Gokul Prabhagaran
Gokul's Background and Role at Capital One
Overview of Data Workflows at Capital One
Regulatory and Business Requirements
System Architecture and Design
Initial Implementation and Challenges
Enrichment Pattern and Its Benefits
Optimizations and Future Improvements
Limitations and Workarounds with Spark
Interesting Use Cases and Lessons Learned
Future Projects and Industry Trends
Closing Remarks and Contact Information