Summary
Working with financial data requires a high degree of rigor due to the numerous regulations and the risks involved in security breaches. In this episode Andrey Korchack, CTO of fintech startup Monite, discusses the complexities of designing and implementing a data platform in that sector.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
- Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
- You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
- Your host is Tobias Macey and today I'm interviewing Andrey Korchak about how to manage data in a fintech environment
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by summarizing the data challenges that are particular to the fintech ecosystem?
- What are the primary sources and types of data that fintech organizations are working with?
- What are the business-level capabilities that are dependent on this data?
- How do the regulatory and business requirements influence the technology landscape in fintech organizations?
- What does a typical build vs. buy decision process look like?
- Fraud prediction in e.g. banks is one of the most well-established applications of machine learning in industry. What are some of the other ways that ML plays a part in fintech?
- How does that influence the architectural design/capabilities for data platforms in those organizations?
- Data governance is a notoriously challenging problem. What are some of the strategies that fintech companies are able to apply to this problem given their regulatory burdens?
- What are the most interesting, innovative, or unexpected approaches to data management that you have seen in the fintech sector?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on data in fintech?
- What do you have planned for the future of your data capabilities at Monite?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png) This episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. [dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)
- Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png) Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)
- Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png) You shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. That is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing. Go to [materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access) today and get 2 weeks free!
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable enriched data to every downstream team. You specify the customer traits, then profiles runs the joints and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack. You shouldn't have to throw away the database to build with fast changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades old batch computation model for an efficient incremental engine to get complex queries that are always up to date. With Materialise, you can. It's the only true SQL streaming database built from the ground up to meet the needs of modern data products.
Whether it's real time dashboarding and analytics, personalization and segmentation, or automation and alerting, Materialise gives you the ability to work with fresh, correct, and scalable results, all in a familiar SQL interface. Go to data engineering podcast.com/materialize today to get 2 weeks free. Your host is Tobias Macy, and today I'm interviewing Andre Korchak about how to manage data in a Fintech environment. So, Andre, can you start by introducing yourself?
[00:01:32] Unknown:
Sure. Thanks for having me here. So my name is Andre. I'm a chief technical officer in the Fintech company, Mennite. We're working in, Fintech API space. So we, our company, is kind of it's like AWS, services for those who want to create Fintech applications, but instead of supplying our customers, with, low level components like databases or cloud service, we basically provide them the building blocks like invoicing or accounts or payments engine. So instead of writing all the business logic, you can plug in, all our p all APIs in, and they're good to go. That's that's me, and then that's our product.
So, obviously, I'm a technical director, and, I'm responsible for everything, we're doing in our company. So taking care about data is part of my responsibilities. I'm not actively writing data processing pipelines anymore or I'm not, managing data storages and data warehouses directly, but I know quite well how everything works, and, I'm I'm responsible for all decisions we take.
[00:02:40] Unknown:
And do you remember how you first got started working in data?
[00:02:44] Unknown:
Sure. I have solid backgrounds in software engineering engineering almost 24 years since I started writing code. And, my previous company, my previous startup was, an EdTech company, and, we were doing natural language processing. Obviously, machine learning was involved, and, I was doing data management, data processing, all these kind of things. Now I'm working in Fintech space, and, we're not machine learning savvy, but, we still have to process customer data. They still have to train some algorithms, and, obviously, I'm involved in into all these things.
[00:03:24] Unknown:
And now getting into the Fintech specifics and some of the challenges around data in that space, I'm wondering if you can just start by giving a bit of a summary of the ways that data is used in Fintech and some of the particular challenges in that space pertaining to data.
[00:03:44] Unknown:
Sure. So our company is working with small and medium businesses, and, the space is a bit different from, say, private banking or investment banking. So my examples will be relevant, only for, SMEs, and, we have few challenges. So challenge number 1, we have to analyze and and interpret quite a lot of financial documents. I'm talking about invoicing, invoices, cancellation notes, purchase orders, receipts, contracts. So, basically, in our case, we have to do OCR. We have to do, pattern recognition and, other things related to document processing. Now second thing, we have to do document categorization.
So we have to be able to distinguish invoices from cancellation notes and purchase orders and surprise. Those documents look very similar to each other, and, separating this this files in this document. It's indeed very hard task. And third thing, we have to deal with compliance. And, in order to do this, we're doing fraud prevention. We do transaction monitoring, and, other things that might be relevant to that topic. So these are main things we do in, in our company, and data related things we do in our company. Now, well, a few words about the data sources, we we have. Obviously, we have, financial operations.
Talking about, bank transactions or card payments. This this type of data is extremely, well, strike well, not extremely, but it's well structured. Yes. There are tiny differences like a different data formats or different currency codes. But, when you have, a list of, transactions made from the bank account, you can make pretty good guesses where where to find transaction descriptions or card numbers, and, that's not very challenging. Well, documents. Financial documents coming from customers, and, there are all kinds of documents, PDF files, photos, scans, or bank statements, receipts, and dealing with these docs is is is indeed very hard.
Now, we have some data that is not directly relevant to, financial, transactions, but this data is is very helpful, when you're doing finance management. We're talking about CRM records or, information from the marketing campaign about marketing budgets or customer support requests regarding some financial operations. And last source of information we did dealing with, is internal data. We're talking about, logs or conversations with our customers, streams of events generated by the users, and other internal things. This this this data is not feasible to to our customers, but, it it plays critical role, when we operate in our software and then making business decisions.
[00:06:55] Unknown:
In terms of the application of data in the business context for fintechs, I'm wondering what are some of the key capabilities that are powered by data in a fintech context and some of the complexities that are involved in being able to bring data to bear for those problems?
[00:07:14] Unknown:
Sure. So, transactions, and, all the financial operations. As I mentioned before, this data is is, well structured, so it it's kind of easy to deal with financial records. There is no need in sophisticated algorithms or machine learning because we already have all the numbers in place, so we can just do do the math, or these records. So it's it's not complicated, but but it's it's crucial to, provide to our customers the spacing functionality related to, financial transactions. Settlement management. It's something that's it's kind of hard to deal with this because as I mentioned before, there are tons of different documents. And, in the modern world world, people still, use, invoices or contracts printed on paper, and it's it's extremely hard to deal with the these records before before you'll be able to to do something with these, documents. You have to scan them, you have to verify them, and you have to categorize these documents. This is called preparatory accounting, and it's it's extremely challenging area.
Obviously, it it's pretty exhausting to deal with all these documents because, sometimes they they come in from different channels, in different from different countries. They may look different, and, it's it's for me knowing to categorize these documents to assign understand these documents to correct department, to right department. So, that's what we're doing. We're processing all, all the flows, informations coming on paper, via fax, via email, and we, we categorize, recognize these documents, and head into the right direction. And analytics, it's something that comes on top of, transaction information and on top of documents.
So we have to be able to calculate, the cash flow. We have to be able to identify gaps in the cash flow. We have to do some forecasting, and other things. So so it's basically 3, 3 major areas we're operating in, documents management, transaction, transactions management, and analytics.
[00:09:25] Unknown:
I know that in the Fintech sector, there are a lot of regulatory requirements. The risks of getting things wrong in the business are quite high because you can start to lose a lot of money very quickly. And, also, any errors in the data or in the application of data can lead to a loss of trust in, among the customers, which can also lead to a pretty substantial financial hit. And I'm wondering what are some of the ways that those regulatory and trust issues factor into the ways that organizations think about the collection of an application of data?
[00:10:01] Unknown:
So as a as a fintech company, we have to obtain, ISO 2701 certificate. The certificates defining, the way we are processing critical financial information. Well, not only financial. Everything that's, that is crucial for our customers, including scans of IDs, receipts, and transactions, all these things. In in order to deal with all these things, we have to be ISO certified. By by the law, when you obtain ISO certificate, you have to enforce some policies on engineers, on the dev ops team, and on all the managers. So the, the access to the data is restrict, and only, a few approved and certified engineers inside of the company can have access to this.
And, couple of DevOps engineers, obviously, analysts and managers also can have the access to to the data, but all the critical data fields, they they masked and removed or an anonymized. And we have some, data protection policies, backup policies, disaster recovery policies. So, basically, we have policies for everywhere. And we also have to make sure that we not only have these policies, but we're able to, identify, let's say, data leakages or critical disasters happening in infrastructure. And we do practice, we do run some training exercises with engineers and trying to simulate disasters or infrastructure cluster collapse and see what what happens after.
Obviously, you have backups for backups, and it's not because we want this, but because, we must have, backups even for backups. And, all the all the backups are encrypted and, not accessible by third party companies, even AWS administrators won't be won't be able to get access to to our data. So, basically, all the all the cybersecurity, all the all best cybersecurity practice are in place. Also, we have multiple data centers, and they have we have separate cloud infrastructures there. So people from from 1 when 1 data center cannot can cannot get access to the data restoring another data center. That's partially because we have to be compliant with GDPR and, other regional regulations, but, also, it's yet another security measure, and we have to make hackers' life quite miserable who they will somehow get access to to what we have.
Also, we have to be very cautious when we're choosing third party providers. As a fintech company, we have to do background checks, and we have to go to the documentation, and we have to be, very cautious when we choose, our service providers. So they have they must be ISO 2701 certified because, otherwise, we're, we won't be able to maintain our certification. They have to be regulated by local laws and regulations. And, also, when we're processing, the data with third party providers, we have to, include, some closures into, terms and conditions because without written consent, we cannot do absolutely anything, and we have to inform our users that, their data will be used by a third party machine learning company that will be doing, invoice processing.
Basically, working in Fintech forces you to sign and to maintain lots of contracts, agreements even with your own employees, with DevOps team, with engineers. They cannot have access literally to anything unless sign the contract, and that's pretty annoying. Basically, dealing with those things, well, definitely makes sense, but it slows business down because before we can take any actions, we have to make sure that we're good on the legal side. And before we can, start working on new feature or before we can sign a contract with new counterpart or vendor, our legal team have to review it, and it's basically just pre annoying.
Apart of that, we are we are not very different from other software companies. We just extremely have it regulated, and we have backups to backups because we have to make sure that, we won't lose any customer data.
[00:14:28] Unknown:
To that point of all of your service providers requiring ISO 2701 certification and the strict security requirements around the data and the ways that it's processed, how does that influence the typical build versus buy decision when you're figuring out what your architectural and system components are going to be? What are the different data flows that you're going to support?
[00:14:55] Unknown:
Okay. That's interesting question, because when you're working in Fintech, you cannot build MVP. Well, yes, you definitely can, build a software that does something, for your customers, but at the same time, you have to be compliant, and you have to follow all these laws, rules, and regulations. Means that you have to invest quite a lot of time into hiring and training people, into signing all these contracts, and establishing all these data protection policies. It means that in our in order to to get to Fintech space, you have to pay quite a lot for entry ticket. Now if you're a Fintech company, you probably will try to buy as many components, APIs, and services as possible because, they're usually cheaper than, doing, all these things in house. If you if you're a seed stage company, you cannot afford to have a compliance officer or, for DevOps engineer, available 247.
But you can pay 10, 15, 20, 000 for that, and, that sounds like a reasonable price comparing to alternatives, and alternatives is to hire everyone, to train their people, to find managers, to write product specs, to establish all these, data regulation policies. So, basically, even even simple invoicing solution, if you want to build this in house, might cost you a few million. And, MVP of invoicing engine doesn't seem like a extremely sophisticated, software, but it could be because of these, laws, rules, and regulations. So if you're early stage company and you want to enter Fintech space, you definitely have to consider, service providers that's already, work in this space because you don't have any other alternatives. Now if you're a big company, at this point of time, you already can afford to build something in house, and there are usually a couple main reasons for that. So maybe you already have experienced team and you have all the expertise that is required in order to build things you want to build, in this case, yes, you can afford it. Another thing, if you already have, established business and you have, and you're already pumping in significant volumes through your your software. In this case, probably, building your own solution will be cheaper than buying third party solution simply because of their wallets.
Or maybe, signing contract with third party service provider is contradicting to your business strategy. And, yes, building a new fintech vertical inside of your company will be expensive, but you have to, but you have to make this, choice in order to make your long term strategy successful. If you're not a big business, you you don't have any strategic goals, my suggestion will be to consider, for third party service providers for your Fintech vertical. But, yes, it creates vendor lock, but the alternative, it can be left away out of the Fintech market because, as I've said before, the entry ticket is extremely expensive.
[00:18:18] Unknown:
And going back as well to your point of requiring very rigorous backup capabilities, I'm wondering what are the orders of magnitude in terms of the size of the data that you're dealing with and some of the ways that you have to think about managing backups and in particular being able to maybe use backup strategies that allow for just using compressed deltas as opposed to having to do multiple full copies the data?
[00:18:47] Unknown:
So transactional data doesn't take lots of space because, we're just talking about numbers and very short strings of text. There is no problem with transactional data. And, usually, Fintech companies not processing payments directly. Usually, there is a second copy of information that is always available on on the side of the payment provider. That's easy part. Now we're talking about the fun part, financial documents. Well, financial documents, they have significant sizes. Something well, typical invoice, you know, could be a could have a size of 1 megabyte or an average. And, some companies, they're having, like, a multipage long contracts, and and the size of those documents can be up to, up to, like, a half gigabyte, and, obviously, we have to take care about that.
And that's a place when we have to rely heavily on on the cloud infrastructure, and the, s 3 storage is here to save the day, obviously, but we cannot give direct access, to, to the buckets stored inside of s 3. So, usually, we have to store and create its binary data inside of s 3, and we have to stream data from those buckets to to our customers. Now talking about backups, basically, we created not very sophisticated, but the same time reliable, system of, that source, of files in s 3, Basically, using cross regional replications of AWS, obviously.
Also, we're making secure backups via other AWS services like AWS Glass here, which which which is a perfect solution for this kind of things. But we have to make sure that those binary files, they're encrypted, and they're not, available, outside of our private network. If our customers want to get access to binary data, we have to stream and decrypt, these files on the fly directly to our customers.
[00:20:51] Unknown:
Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte scale SQL analytics fast at a fraction of the cost of traditional methods so that you can meet all of your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and DoorDash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first class support for Apache Iceberg, Delta Lake, and Hoody, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
In terms of the application of data in the Fintech sector, they have been 1 of the longest users of machine learning in the context of things like fraud detection, particularly in banking contexts. And I'm wondering what are some of the primary ways that you're seeing ML being applied in the Fintech sector and some of the ways that those, ML requirements as far as training and serving influence the architectural design and the capabilities of data platforms for those types of applications?
[00:22:19] Unknown:
Well, obviously, after OpenAI, released the chat GPT, everyone starts started worried about, their future in the business. And, I had some conversations with our product managers and other c level managers, and I wasn't really worried at all because, like, from my perspective, this raise of, large linguistic models and other AI technologies won't change Fintech industry at all. So I have a few reasons, for that. First of all, machine learning is already in Fintech for at least couple decades. And fraud prevention, that's that's 1 story, but, you also, machine learning algorithms are widely used for credit scoring or for risk analysis.
These algorithms are quite old, and they were here even before, deep learning and large neural nets. So banks using this, their subgrains already for decades. And, plus, if if you're talking about transactional data, as I mentioned before, this data is extremely well structured. So we don't we don't have to run all this sophisticated algorithms, on financial data because we simply can do math. So modern, machine learning, won't change that much because pretty much everything was was done in this space before. Now, the big thing, was your optical charity recognition. As I said before, we have to deal, with documents that are coming on paper, and we have to deal with status of receipts, and it's green line.
And, deep learning and other machine learning algorithms, they they definitely revolutionize that space because, the quality of OCR processing increased dramatically. And, if you remember, like, 10 years ago, there was a thing called Tesseract, and the quality of, character optical characters recognition was pretty poor. The accuracy was pretty low. So you could expect, like, back in the day, something around 40, 50%. Now with with a w s text track, we can have accuracy up to 95, 97 percent on some documents. So pretty much all the text, is getting recognized properly.
And, also, the quality of preprocessing increased dramatically. So if you if you have a folder taken some if you have a folder of receipt that was taken in the dark, with flashlights and, with the wrong angle, the text rack will still be able to recognize and extract data from the, from that folder. So, OCR definitely, reshaped, the preparatory accounting space and the way SMEs working with financial data and financial documents right now. Another thing that is definitely affected by a rise of more in machine learning technology, I'm talking about machine, large linguistic models, If you're a Fintech company, normally, you have 1 support, customer support officer, 1 customer support engineer per 1, 000 customers. And if, all these customers will decide to write to that person at once, you basically, are going to have problems because this, usually, the customer support, team is not designed to handle big volumes. And that's what typically happens when you have major incidents in in banking infrastructure. People start texts immediately start texting and calling you, and your customer support center is not able to handle this request.
So large linguistic models, decrease the operational cost for, customer support, And, help desk team is 1 of the biggest drivers drivers of the expenses if if you run a fintech company. So LLMs are potentially able to completely eliminate, well, not potentially completely eliminate, but significantly reduce the side of the support team. It'll be possible to have, let's say, 1, 1 agent serving 10, 000 people and 20, 000 people. And it's something that definitely will, will improve over the time, and we have big plans for that.
[00:26:47] Unknown:
Another major challenge in all data applications and data platforms, but in particular with Fintech because of the regulatory burdens, is that of data governance. And I'm wondering, what are some of the ways that you're approaching that problem of data governance and ensuring that you have appropriate visibility and access control and segmentation of data access and some of the organizational and policy aspects that you've had to invest in to be able to ensure that you're doing data governance in a way that is, compliant and fulfills the regulatory needs?
[00:27:25] Unknown:
Yeah. As I mentioned before, data access is extremely limited, and only, a few trained people can get can get access to this. Also, every team in every vertical in our company, they have separate database clusters. So they can see only only, the data of their users, and, only few people in the company can have access to all databases in our company, including me and maybe engineering, managing, couple DevOps engineers. Now all this we have to do, we have to deal with data analysis, and we have to give access to, to the data to analysts and managers.
So we have to offload all the all the information from database to data warehouse. And before, if before we send data to the data warehouse, we have to remove or mask or anonymize critical parts of information. So, basically, analysts and managers, they can do they still can analyze data without the risk to accidentally leak some confidential information on the site. Also, all the configuration files on all the operations, all database clusters, they're recorded in repositories. So we're a GitOps company. And in order to do something, with, with production or even staging data, You have to, write your database queries in repositories. Some somebody is going to review them. And if if the code is reviewed, then the database queries can be executed.
So, basically, code reviews are mandatory, especially for infrastructure changes, and, everything should be recorded. That's the first maybe okay. That's maybe the first thing. And, we, by the law, we must store data for 5 to 10 years, and, we have data retention policies defined and maintained on a permanent basis. So people aware that they cannot delete any financial documents whatsoever. If, even if they see, the company that is not touching the data already for a couple years and they say, okay. Well, they have, like, 1, 000, 000, 000 of invoices there. Maybe you can delete something.
They know they cannot because, maybe tomorrow someone will come and ask the the copy of the invoice that was issued, like, 4 years ago. And, oops, we're gonna have problems if if this data will get get deleted. And, yes. So that's 3 major things. Access is restricted. Everything we need to do, in terms of and data management should be recorded and reviewed, and, all data records should be stored for years.
[00:30:14] Unknown:
Another aspect of that governance challenge is also on the side of discovery, because it's hard to do anything with the data if you don't know that you have it. And then once somebody, finds a particular dataset that they want to work with, I'm wondering what the process looks like for being able to request and gain access to that data and some of the audit controls that you have around how to ensure that they are doing what they say they were going to do and that the, access is managed appropriately?
[00:30:46] Unknown:
There is a bureaucracy, and we have we have to we have to enforce the bureaucracy on employees because all the certifications and compliance. So if you need to get access to data, you have to explain to dev ops team and to me, why do you need this and what exactly do you want to get from the database. And you have to also specify, the time span you want, have access to. Maybe have to specify, IDs of the customers of users you want to see in the database. And after that and after we, me and the engineering manager and DevOps team came up with a conclusion that, yes, you indeed can have access to that data.
We'll manually execute the database queries and prepare the dump and hand this over to to 1 of the engineers and analysts. Obviously, all the critical information will be removed from that dump. So, let's say, invoice, well, IDs of the bank accounts or personal IDs will be completely removed. If the data is needed on a permanent basis, so for instance, you're working on a new data model, then we have to prepare something for the snowflakes. So we have to create a a data expert process that will be, sweeping off all the critical information and automatically pouring this data to to Snowflake. But, before we're doing this, we have to understand that we'll be, we'll be dealing with some portions of data on a permanent basis. Otherwise, we just do a manual dump and then give it to engineer, making sure that there is no critical data inside of this.
[00:32:29] Unknown:
In your work in this space and working with customers to enable them to build financial applications, what are some of the most interesting or innovative or unexpected approaches that you've seen to data management in this Fintech sector?
[00:32:43] Unknown:
In Fintech sector in general, I was surprised that, some companies, they're building quite sophisticated proxies for database servers. And as proxies, they're able to identify on the fly that direct credit card numbers stored in database, and as proxies will automatically remove all the critical information right from your database queries. So you won't be able to get the date access to that data, what's the error, and you don't have to write any configuration files. So obviously, the proxy servers, when you're writing this, config saying, okay. There is a column called card number. Please remove all the day, all the digits from there and just leave 4 last digits.
But you have to configure this this rules manually, and it's extremely annoying. So I was surprised to see that, machine learning can identify quite well all these patterns in database columns, and everything was and and everything will be stripped away on the fly. That's that's a game changer because, preparing data for, analytics analytics or for product managers is extremely annoying process, but we to deal with this anyways.
[00:33:54] Unknown:
And in your own experience of working in this space and building a company focused on Fintech, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:34:05] Unknown:
Definitely had a challenge this year. It's not directly related to Fintech, but it's related to API First Business and the API First Company. It was surprising to me that dealing with multiple API versions, on a massive scale is pretty much tough, because we not only have to maintain multiple versions of APIs, but we also have different versions of data created by different API calls. And managing all these things, inside of application was indeed very challenging. So we identified 3 major strategies. And, first of all, we were thinking about store storing and managing, data in different Kubernetes clusters with different versions of APIs.
It's obviously it's pretty tough approach because we have to maintain different branches of the code. And whenever you need to patch 1 cluster, you have to cherry pick your commits from the repository and have to somehow patch other clusters too. But the same time, you're still going to end up with multiple versions of data created from multiple cluster inside the database, which is which is nightmare. There is a second approach that is commonly used, in software companies. You just create little folders, obviously, different versions of your code. So let's say you have version 1, and now we have to create version 2. So you basically copy paste version 1 to folder called version 2, make changes there.
But this code works with the same database, but same time well, looks like a very simple approach, but, if you're dealing with, like, a 10, 15 different versions of APIs and update that, that's extremely challenging. And we're API first company, and we're dealing with with Fintech. So we have, some our customers using our APIs for years. So we just simply cannot kill old versions because there is always company using this, and we just cannot send them a letter saying, yes. We're going to switch out this version completely, something from, January 1st. Well, please switch. Well, we can't because they pay money for us. So we have to support this more multiple versions, and that was the approach we initially took. So we created these multiple folders, and, we had all these problems.
And 1 of the engineers, he created a framework for API versioning. The philosophy behind this solution is quite simple. So we always have last version of data available in database and last version of the business logic. Now if you have to serve data to someone who is using previous versions of APIs, this framework basically downgrades the data with the presentation to, to the requested version. So let's say we have 5 versions in total. So last version's version number 5. So all data and database and all the data warehouse is stored according to v 5 specifications.
But if you have customer using v 4, then we basically, downgrade their v 5 back to v 4. It's like a, database migration process, but done backwards. And turns out that it it's extremely efficient process. The the number of lines of codes dropped dramatically. We removed, like, a hundreds and thousands of lines of code that was so hard to maintain, and we ended up with this solution. So we always store the latest version of data in our data system, designed according to latest version of text specs we have, and we just migrate all the data representations backwards to to the requested version.
[00:37:48] Unknown:
API versioning is a hard challenge no matter what space you're in, but having to maintain it for a indeterminate amount of time definitely adds a significant amount of burden and probably technical debt to ensure that you're able to maintain that backwards compatibility.
[00:38:04] Unknown:
Sure.
[00:38:05] Unknown:
Yeah. That that's definitely a very interesting solution that you found to that, being able to maintain the latest representation of the data, but then being able to translate it backwards. So that's that's interesting. And, so you mentioned that you keep the latest in the prior version. Do you have to because of the fact that you have people using some indeterminate version of the API for in perpetuity, do you have to maintain the, you know, multiple versions and multiple downgrade steps for a particular data representation?
[00:38:37] Unknown:
Yeah. Basically, it's it's it's it's pretty it's pretty hard to to jump from v 5 back to v 1. So probably, it's a good option to to do a couple of migrations in the middle. And that also is very helpful when you're doing debugging because you can see all these changes between different versions. Yes. It's still pretty annoying process because you have to migrate data representations.
[00:39:02] Unknown:
And as you continue to build and iterate on the product that you're building at Monite and invest in this Fintech sector and enabling other businesses to build financial applications on top of your platform, what are some of the things you have planned for the near to medium term as far as your data capabilities or your platform architecture at Monite or some of the problems and projects you're needed to dig into?
[00:39:27] Unknown:
We have, more like a long term strategic plan because we're dealing with quite a lot of transaction information coming from SMEs in different countries. We came up with our ultimate goal and with our North Star. We're eventually planning to to keep, to kill all, financial documents and OCR in fintech space because there is a there is a swift product of financial transactions, made between different banks. And before that, people used used telegraph in order to make money transfers. But, right now, we live in 21st century, and people still print invoicing and paper and delivering these invoices to each other via physical mail. And, our plan is to completely kill that industry. Obviously, it's kind of ambitious, but we're gonna take baby steps towards that goal. And probably it's gonna take, 5 years at least.
But the idea is to completely kill all these PDF documents, scans, and replace them with machine readable, messages. And, these messages should have very strict structure inside. So instead of also wiring your PDF files, you'll get access, to all your financial records instantly, and have to manually adjust, and verify your financial records. I guess it's something that's, will probably eliminate thousands of the jobs because someone has still has to scan and print these papers and deliver. But I guess it's a good thing to do, because, well, yes, 21st century. We definitely can do better than a bunch of PDL files and paper envelopes developed, by a mailman.
That's that's in our store. But on the way to that, to that goal, we're gonna we're gonna keep expanding in different markets. It's pretty challenging process because we're infrastructure company, and we have, spikes of traffic. So when we onboarding on onboard new customer, you may, start serving, let's say, a 100 k new users overnight. And, usually, when you're doing, b to c software, you don't see this kind of, traffic spikes, quite often. Usually, you see a number of users, gradually increasing, over time. But if you have platform and, your customer just start importing their users, to infrastructure, then, yes, there are spikes. And dealing with the spikes in something that's, is is it it's it's pretty hard process because, yes, we have plans for everything.
We we have, all the all the infrastructure, challenges modeled, and we're prepared for everything. But reality is, however, a bit different. And, yes, that that's a that's a that's a long process of our system evolution towards that goal.
[00:42:26] Unknown:
Are there any other aspects of the work that you're doing at Moneight or the overall problem space of data management in a Fintech in a Fintech context that we didn't discuss yet that you would like to cover before we close out the show?
[00:42:39] Unknown:
Yeah. There is 1 interesting topic. So currently, all, pretty much all garments, across the globe, they they're trying to introduce local standards for financial documents. And seems like a nightmare because even in European Union, now we have Italian invoicing standard, Portuguese invoicing standard. We have German, standard for financial documentation. And, basically, we have the zoo of data formats and standards, and some of the standards are not even to even have any explanation in English how it works. So if you want to, get connected with Italian invoice and you have to circle through, 2 or 300 page long document written in Italian language explaining how the scene wasn't works, and after you have to give a call to, Italian authorities and say, look, like, you would have a plan to get connected with your documents and exchange gateway.
So, basically, that's a big night nightmare that is coming on our heads. And, I I hope, like, in the 5 to 10 years, people, will abandon this idea because right now, it just multiplies the problems. And instead of just dealing with PDF files and financial records coming from different banking systems, we now have to deal with, hundreds of different XMLs, JSONs, or binary formats. And so it's it's it's it's just for it's just, a very big challenge.
[00:44:13] Unknown:
Yeah. The is there at least any sort of common subset of information is there at least any sort of common subset of information across those standards, or is it every country has decided on their own bespoke format that you they are forcing people to comply with and you have to do a specific implementation for each 1?
[00:44:32] Unknown:
So there is no definite single standard across different countries, and I guess, governments don't have any intentions even to start discussing these topics because everyone says that, okay. Well, it's time to introduce our own standard and, get some extra money from the taxpayers because the invoicing, will get give us better access to data and, hence, better access to taxes. So everybody is excited by, everybody is excited excited by the fact they can get extra money from the taxpayers, but nobody is is thinking about the consequences. So that's a problem.
[00:45:11] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:45:28] Unknown:
Well, definitely, in the Fintech space, we have to build manually all this data management solution. As I said before, there are lots of restrictions. There are lots of, rules that enforced by, local laws and regulations. And that will be absolutely awesome to see a database or data storage solution that will take care about all these things, because, roughly saying we spend, like, a 25% of our tech team budget on on, dealing with this and having database that is already prepared to face all these Fintech challenges and that is already combined with all these laws and regulations, that will be a game changer.
I guess it's not hard to build this technology from technical perspective. But from the compliance perspective, it's complete nightmare. And probably in order to achieve that, someone needs to have, like, a 10 skilled software engineers and 100 compliance managers and and lawyers and product managers in order just to explain how it's supposed to work. It's I I hope someone someone will create that solution in the future because, it's just so painful to to deal with this all these all these things manually.
[00:46:50] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share your experience and perspectives on data management in Fintech and the work that you're doing at Moneight. It's definitely a very interesting and important problem domain, so I appreciate all of the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day.
[00:47:16] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast.init, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at dataengineeringpodcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and just tell your friends and coworkers.
Introduction and Guest Introduction
Data Challenges in Fintech
Key Capabilities Powered by Data
Regulatory Requirements and Trust Issues
Build vs. Buy Decisions in Fintech
Data Backup Strategies
Machine Learning Applications in Fintech
Data Governance in Fintech
Innovative Approaches to Data Management
Future Plans and Goals for Monite
Challenges with Financial Document Standards
Biggest Gaps in Data Management Tooling