Summary
There are extensive and valuable data sets that are available outside the bounds of your organization. Whether that data is public, paid, or scraped it requires investment and upkeep to acquire and integrate it with your systems. Crux was built to reduce the total cost of acquisition and ownership for integrating external data, offering a fully managed service for delivering those data assets in the manner that best suits your infrastructure. In this episode Crux CTO Mark Etherington discusses the different costs involved in managing external data, how to think about the total return on investment for your data, and how the Crux platform is architected to reduce the toil involved in managing third party data.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
- Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
- Tired of deploying bad data? Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes. Build powerful workflows that connect your entire data stack end-to-end with a mix of your code and their open-source, low-code templates. Once launched, Shipyard makes data observability easy with logging, alerting, and retries that will catch errors before your business team does. So whether you’re ingesting data from an API, transforming it with dbt, updating BI tools, or sending data alerts, Shipyard centralizes these operations and handles the heavy lifting so your data team can finally focus on what they’re good at — solving problems with data. Go to dataengineeringpodcast.com/shipyard to get started automating with their free developer plan today!
- Your host is Tobias Macey and today I’m interviewing Mark Etherington about Crux, a platform that helps organizations scale their most critical data delivery, operations, and transformation needs
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Crux is and the story behind it?
- What are the categories of information that organizations use external data sources for?
- What are the challenges and long-term costs related to integrating external data sources that are most often overlooked or underestimated?
- What are some of the primary risks involved in working with external data sources?
- How do you work with customers to help them understand the long-term costs associated with integrating various sources?
- How does that play into the broader conversation about assessing the value of a given data-set?
- Can you describe how you have architected the Crux platform?
- How have the design and goals of the platform changed or evolved since you started working on it?
- What are the design choices that have had the most significant impact on your ability to reduce operational complexity and maintenance overhead for the data you are working with?
- For teams who are relying on Crux to manage external data, what is involved in setting up the initial integration with your system?
- What are the steps to on-board new data sources?
- How do you manage data quality/data observability across your different data providers?
- What kinds of signals do you propagate to your customers to feed into their operational platforms?
- What are the most interesting, innovative, or unexpected ways that you have seen Crux used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Crux?
- When is Crux the wrong choice?
- What do you have planned for the future of Crux?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Crux
- Thomson Reuters
- Goldman Sachs
- JP Morgan
- Avro
- ESG == Environmental, Social, Government Data
- Selenium
- Google Cloud Platform
- Cadence
- Airflow
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Shipyard: ![Shipyard](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/v99MkWSB.png) Shipyard is an orchestration platform that helps data teams build out solid data operations from the get-go by connecting data tools and streamlining data workflows. Shipyard offers low-code templates that are configured using a visual interface, replacing the need to write code to build workflows while enabling engineers to get their work into production faster. If a solution can’t be built with existing templates, engineers can always automate scripts in the language of their choice to bring any internal or external process into their workflows. Observability and alerting are built into the Shipyard platform, ensuring that breakages are identified before being discovered downstream by business teams. With a high level of concurrency, scalability, and end-to-end encryption, Shipyard enables data teams to accomplish more without relying on other teams or worrying about infrastructure challenges, while also ensuring that business teams trust the data made available to them. Go to [dataengineeringpodcast.com/shipyard](https://www.dataengineeringpodcast.com/shipyard) to get started automating powerful workflows with their free developer plan today!
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Atlin is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlin's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans could focus on delivering real value. Go to data engineering podcast.com/atlan today, that's a t l a n, to learn more about how Atlan's active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork, and Unilever achieve extraordinary things with metadata.
When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to data engineering podcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Mark Ethereington about CrUX, a platform that helps organizations scale their most critical data delivery, operations, and transformation needs. So, Mark, can you start by introducing yourself?
[00:01:40] Unknown:
Yeah. Sure. Mark Etherington. As Tobias said, I'm the CTO of CrUX. Been there about 3 years now. Prior to that, spent a few years hiding out at Thomson Reuters and Refinitiv, so their global trading tech. And then prior to that, I've had a startup of 2. I've worked for Goldman Sachs, JPMorgan, and software houses along the way as well. And do you remember how you first got started working in data? I actually thought about this after you said you were gonna ask this 1 to us, and it was, like, very, I think, forever is the short answer. But the question is, like, data what's data management? And it's 1 of the problems we have in the industry. Its definition is everything.
And data management is not very well defined for something that needs to be managed. So I think, I mean, I was looking back my very first job was taking real time streaming of data, changing its schema, transforming it, loading it into a different system. And then I've looked through every application development job we've ever done. And funny enough, none of them don't use data. So it's a I think I've been there forever. I think, classically, things like market data, schemers, all this type of good stuff really came into the fore probably 10, 15 years ago, much more into market days and old classic market data, high speed, low latency type of structures that we were using.
But there's also coming down to earth, you know, moving data centers has appeared on my career list at some stage as well. And there's nothing that gets you in touch with large data when you see all of those fantastic arrays that either you need to move into the cloud or you need to physically move between buildings. So I think the short answer is forever.
[00:03:15] Unknown:
And whenever you're dealing with actually physically moving the data around, you need to make sure you're bringing along a whole suite of Band Aids too. Yeah. You're not kidding. And maybe 2 or 3 copies along the way. So who who knows? Awesome. And in terms of what you're building at Crux, can you describe a bit about what it is and some of the story behind how it came to be and why this is a problem that you wanted to spend your time on? It's actually our birthday today, so it's our 5th year birthday, so great timing
[00:03:40] Unknown:
in terms of the interview. And I was reflecting on, you know, I don't think it's really changed, to be honest. What we set out to do was many of us came from market data backgrounds or big data providers as well, And we saw a real inefficiency in the market between connecting suppliers to clients. And, frankly, suppliers thought they had the best way of doing things. All the clients saw was different. So these 2 things are neither wrong in terms of how they were positioning themselves, but there's a real opportunity in there. So what we wanted to do, it was build an ability to basically get data from anywhere, any format, any channel, any type of structure that the supplier wants to do, make it really easy to get that data to consumers.
And the consumers, making it easy for them is really about putting consistency around how you actually interact with the data. So you know things like the last lot of data that was loaded was this. The 1 that I loaded last week was this, and it's consistent across all the different vendors. Now a lot of the data aggregators try to do part of this, but they also, frankly, put licensing constructs on top. We don't get involved in licensing. So we try to take the high ground here, which is very much a the licensing and the contracting between a data supplier and a data consumer should be between them. It's their financials. We offer a value added service in between, which allows us to get the data seamlessly between multiple vendors and multiple consumers. And there's a mutualization play there, which is is great. And when you're doing that, then, obviously, you can start adding really great services in the middle. So you can start doing things like quality checks. You can do transformations, You know, schema evolution, sort of things that we're seeing for years in file formats like Avro. Certainly, you need to be able to do those on mass, on scale, and databases and the like as well. So we see a real, yeah, uptick in the ability to add real value in the middle as well as delivering the data from end to end.
And over the 5 years, it's largely panned out. I think the journey has all started out. It always takes that little bit longer than you always think it's going to be. Thankful to be along for the ride. I've been here for 3 years. We're we're gonna get this 1 going even further, and it's really been fantastic to deal with the wide range of clients and consumers that we've actually interact with to to, obviously, sales, but also getting them up and live, seeing real world engineering problems. And no matter how many real world engineering problems we all think we know, there's always more. Every client brings their own slightly unique flavor of pain to the table that we're trying to help. I think it's interesting
[00:06:15] Unknown:
that you stay out of the sort of vendor agreement, contract negotiation aspect of it, and you're purely focused on the just managing the workflows, helping with the integration piece. And I'm curious, I guess, given that fact, how you get involved in that overall conversation, or is it the situation where there's an organization, they have a contract with that data provider, and they say, okay. Working with this provider is too challenging, and then they find crux. Or is it something where they say, oh, we're already using crux. I see that you have integrations with these different data providers. Can you broker an introduction between us, and then we'll handle the contractual bits, and then you'll do the data?
[00:06:56] Unknown:
All of those. I mean, the reality is that you can't move in our industry without some form of paper. Yeah. Even if you're web scraping, you better check your t's and c's on that website before you use that data. Yeah? So the way we're structured, we have some supplier clients, and we serve both sides of the industry. Yeah? So we have some supplier clients that actually want us to run trials for consumers. So in those instances, getting a client onboarded, getting them up and running, getting the data delivered to them in the format they want, not necessarily be something that the supplier couldn't do if they wanted to put time and energy about on it. But often the supplier is focused on the data, not necessarily on the channel.
And more distribution channels and more transformations their client wants can actually mean obviously more costs and not necessarily more revenue from a data supplier's perspective. So in an ideal scenario in that mode, then the supplier actually brings clients to us. On the other side of it, we either it may be a startup, whether it's in the financial industry suppliers, and they're looking for data, and they see effectively that we have a large inventory of data that's on the system. Now say we have it, the reality is it's on the platform. That doesn't mean it's free, available, and just gonna be distributed by us. In those scenarios, the inventory is, if you like, the interest that the consumer has. But the very first thing we have to do is establish with the consumer legally who are they, and we will go back always to the supplier to say, this company over here, these individuals, they say they've got a contract with you. Yeah. Do they? So there's a compliance step. And if they have, you're good, yeah, that we can distribute your data to them. In some cases, that's not the case. Some cases, the the supplier will say, no. Actually, I wanna do that myself, or there'll be a different set of conversations. That's fairly rare, but they do happen.
The consumer is driving because they just wanna get access to the data, and they don't wanna deal with the bureaucracies, frankly, of dealing with 50 different suppliers. So from their perspective, they have got the contract. Once they've got the contract, they don't wanna have to go through the pain of dealing with them physically setting up different FTP connections, if they want Snowflake or they're in Google or Azure. They don't wanna have to go through that infrastructure pain. In fact, I don't think many of the infrastructure people wanna go through that pain either, but they often see that as a great way of getting speed. So we have a bunch of other type of datasets as well. So we do real time connections into using APIs.
We do a lot of web scraping. Some of the big datasets out there from the government are on the platform as well. So those are To be To be fair, every time we've tried to think about and say, did you want us to take a position? The answer is always sort of, well, yes. We would quite like it, but no. Because at the end of the day, everybody thinks they're the best negotiators. Not everybody likes the way that the aggregators actually are in the middle of these processes, and they want to be able to do things the way that they want to do them. And what we offer to consumers and suppliers, we act on 1 or either of the behalf. We don't, you know, intersect badly in the middle of any of this. So we're either acting as the, you know, the agent for the consumer or the agent for the distribution, you know, source, which is the supplier. It's worked out pretty well, and it means we can, you know, sort of become the Switzerland of data, keep fairly neutral during the fun and games of the contract wars that always seem to start between various parties.
Does that make sense? Absolutely.
[00:10:33] Unknown:
In terms of the types of data sources and data providers that you're working with, I'm wondering if you can talk to some of the categories of information that organizations are looking to bring in and integrate with and maybe some of the use cases that they're looking to support by integrating those external data sources that aren't wholly owned by that organization.
[00:10:55] Unknown:
Yeah. Sure. I think if we separate the world out into, you know, 2 main pieces, 1 is getting the data, and then there's an internal piece of distributing the data within the company. We typically don't focus within the distribution within a company, although we will do distribution to multiple entities of a given organization. But the sourcing of data is actually pretty interesting. It's like, how long is the piece of string, to be blunt? We have bumped into pretty much every dataset on the planet, 1 way or the other. Now there are some things that are broad categories. ESG, environmental, social, governmental type of data, very big thing at the moment. A lot of regulation coming up certainly in Europe, even in the US. You got carbon pricing.
There's a lot of energy in that space, if you pardon the pun. Now within that, it's still fairly nascent, to be honest. There are hundreds of datasets. Everyone is vying for position in terms of they have the best benchmark or somebody else. They wanna take somebody else's data and create a benchmark, or here's the raw data. Clients wanna consume it. And these are things through, zip code based assessments of environmental health issues, zoning rules, governance type of data being extracted from PDFs in the annual reports for the companies. And those are just that's 1 small sector. If you look at the broad gamut, what you view as pure referential data, reference data for the banks, so you're looking at things like currency codes and country codes. All the classic ISOs are in there through to entities, entity masters. So you're talking things like global LEI. We've got the global LEI database loaded up. That's an interesting source of entity relationship information, but you've also got all the third party datasets as well. You've got FactSet, DNB, Bloomberg, Refinitiv. Everybody has their flavor of the entity masters.
Everybody has a flavor of security masters inside the financial industry. But you also look at what everybody calls alternate data. That's not so much an alternate format, but alternate data is probably best described, I think, by the what I've seen it best described as basically nontraditional. And that could be credit card receipts. It could be shop sales. It could be weather forecasts. There's a wide range of things that you wouldn't necessarily believe that financial institutions would want, that they will go after, basically, to get a hint or insight into what's going on in the marketplace to give them that small edge in how they want to position their investment thesis or their trading thesis is on the back end of it.
You've got others. Supply chains are are classic at the moment. Supply chain risk. Everybody's worried about, you know, with COVID, everyone, I guess, has seen the container ships, you know, queuing up in all the major ports. That's 1 piece of this, which is logistics. And there are data sets out there about logistics and what ships are going where. The other side of that is who's actually shipping, and those shipping manifests, which are also able to be, you know, obtained. The amount of lorries leaving mister Musk's factories with cars loaded on them are datasets that are actually out there. But there's the other side of supply is risk, reputation risk. And this is where there's an interesting intersection we're seeing between ESG and supply chain.
In those modes and conversations which are, do I really know who I'm dealing with? I am a major food company. I have many suppliers. Are those suppliers ethical? When I say that my fish is line caught, how do I know that's true? How do I know that actually the owner of the company is actually not the person on the manifest of the company is actually 2 stages removed, and there's actually a reputation risk there for me in the marketplace that I'm not being, if you like, true to what I'm saying I'm doing, or I'm gonna be associated with a nefarious character of some form. So we're seeing an intersection in the supply chain between what was classically know your customer in the financial industry with know your supplier in the supply chain industry and the ESG risks around dealing with those companies, whether that's, you know, cage free eggs all the way through to, you know, individuals with poor track records running organizations.
So a very wide gamut.
[00:15:20] Unknown:
As far as the challenges and complexities and risks that organizations encounter when they're trying to bring in those external data sources and manage them and make them part of their production data systems. I'm curious what are some of the, specifically, the primary risks that they're dealing with, but also some of the incidental complexities that come along with it.
[00:15:44] Unknown:
At the most basic level, it takes time. I mean, what we've seen is if we happen to have the right inventory on the platform that the customer wants, you you just can't go fast enough. But the reason I raise that is that the other side of it, it shows you where they feel the pain. The initial trialing of datasets is, like, by a car or a bike, but when you're rarely do you go without a test drive. So start point of many of these processes, is the data actually what I want? Am I willing to enter into a license for this data? We find that that process can be fraught with third parties.
Yeah. The sales suppliers are very keen to get it there. They want new customers. They want to get that process running, but that can be quite hard. And the the difficulty is, again, contracts, just making sure paperwork and lawyers are interacting correctly. And it's physical, obtaining connectivity. Now whether that's on the supplier's side or whether that's on the consumer side, going through the appropriate security protocols to get the data loaded. And then the real fun starts, which is it's not quite in the way I wanted it. I need it shaped like this, or I need it enriched like this, or I need it combined like this in terms of the datasets themselves. So the risks are paramount of, did you get the data you really thought you were getting? Is it gonna add value to you? How do you get through that trials process to defray that risk through to physical issues?
And then once you've got that sorted out and you're getting the data, then you have, okay, how am I gonna manage change? When I look at the process, I guess I was a little bit naive when I started looking at this in detail 3 or 4 years ago in thinking that there was a different life cycle here than normal, if you like, development. I don't think there is. This is very much pipelines of our code and config. Code is code. How often have we heard people say that, you know, config is code? That whole life cycle of developing a pipeline and then getting the scope right at the beginning is what we're talking about all the way through to, you know, someone's changed the schema or a supplier and did not tell us.
Your pipeline is likely to go down. How do you manage that process of change? How do you make sure when files are late? Because if you care about the data, you're probably plumbing it into your system flows. Who's chasing that data for you? You've got to treat data as a first class citizen. And whether you do that internally, yeah, we have a service unless you do that externally, but you have to do it. And you have to recognize that data needs a little bit more nurturing than just a a 1 shot farm forget at the beginning. And I think that's probably the biggest risk. I wouldn't say with a lack of respect, but trading the fact that that's really important to you. Otherwise, why are you spending cash on it? And if you're spending cash on it, don't you wanna make sure you get the maximum amount of it and then it's not going off the rails for you in in whatever form that might be?
[00:18:51] Unknown:
As far as the costs that are associated with managing these data sources, it's definitely going to be a major concern, particularly as you scale the size of the organization and the amount of data that they're dealing with. And I'm wondering what are the primary sources of cost that get factored into actually integrating these data sources and working with them? Obviously, there's the contractual element, but I'm wondering how much of the overall cost of ownership is involved in that. And then what are some of the other ways that cost of ownership gets factored into that equation?
[00:19:26] Unknown:
I'll go back to the development analogy. I think it's very, very similar. We're gathering stats. We've got over 25, 000 pipelines live. Yeah. That's a fairly big bunch. We're getting better and better at actually capturing our own metrics about how long does it take, what's the rate of change of schemas, and so forth as we go through it. But it's fair to say today, I think it follows pretty much to the letter what we see in the development cycle. Yeah. Effectively, the first pass development cost, however you wish to define that in terms of the data engineer, the connectivity being established, the physical infrastructure that's being stood up to actually run the work or leveraging it, that whole piece is probably sub 20% of your total cost. Now your lifetime of data is tricky to estimate. Yeah? Some organizations will take data in that, you know, that data set's dead in 6 months. I no longer want it. Others will use it for a very long time.
It's very difficult to put that parameter around it. From our experience, you take any data north of 6 months, you're likely to run into a change of some form. You're definitely likely to run into operational issues at least once or twice during that process. So you've got operational overheads. You've got change overheads, maintenance of schemers. You've got your own internal coordination. You've gotta run it on your own infrastructure. You've got to develop the code, or you've got to get a solution that either gave you a tool to have your pipelines, or you've got to invest it in your own platform, which is time, money, cost again to do it. But you've got to get into a position where, you know, you understand the total life of this. And this is where we see a lot of people go wrong is that, you know, they look at it as the amount of conversations that I had was, well, well, web scrape. Yeah. Surely, I can just do that this afternoon. Yeah. Just run up some selenium and off I go. Okay?
Off you go. Okay? If you wanna do that, that's absolutely fine. If you wanna be in a position where you trust that pipeline is gonna work and when the when those websites change, which is the extreme end of this, that you can actually your pipeline's still gonna work or you've got someone that's gonna reinvest in that process, then you have a lot that you have to actually take into account if you're trying to work out what the overall, you know, math of your equation is going to be. We find the conversation very fruitful when you actually are talking to practitioners. It's very much more on education, a further up the stack. When you get to CDOs, they get it.
It is an education. It's not taken by default. I mean, I still run into people that think the software just grows on trees occasionally as well. That sort of thing has dissipated over time. With data, I think there's a lot to be learned by just looking at the cost structures around fundamentally, you know, development in its own right, which is what this is, to be blunt.
[00:22:15] Unknown:
The other interesting element of figuring out what is the cost of a given set of data is then figuring out, okay. Well, what is the actual value that it's going to produce? Which is definitely a conversation that's been long in the making where 10, 15 years ago, as Hadoop was on the rise, it was just throw everything in there. It'll be useful someday. And, you know, then people started to realize, oh, wait. This is actually costing us a lot of money, and I've yet to see anything really useful come out of it. And I'm wondering how that conversation around value of data and the ways that it's going to produce that value has started to evolve and how that gets brought into that conversation around what is the cost of acquiring this data, and is the cost benefit analysis worth it?
[00:22:57] Unknown:
It's really interesting. Okay? Because I think so, yes, we deal with a lot of people in the industry for all sorts of levels, and there are there are some that are actually starting to think about data as a product. I think that's the start point, frankly, of the conversation because if you treat it as an asset yeah. At the moment, you don't know it's a liability. So if we're talking about, yeah, what you're doing, a liability and that you're paying for it, You know, if you do the math on the total cost of ownership, it's probably well north of 10, 000, yeah, okay, over a couple of years, at least, even for the most basic dataset. Add that up over a couple of 100 datasets. You're talking real money, yeah, in terms of operations.
So unless you can do something with it, you are clearly carrying a liability. Now what I've seen on the other side is, like, okay, define it as a product, start treating as a product, doesn't necessarily go down too well with the end consumers and organizations. You know, they're jumping up and down saying, give it to me now. Give it to me now. I can make more money with this if you're in a in a bank, or I can manage better risk, or I'm a supply chain, I can get. So trying to get to put a monetary item on that data, it's extremely hard.
We've had 1 incident where we actually had a great conversation, yeah, where somebody actually knew that the data set they've got, they had a thesis, they've proven the thesis out with a trial of data from us, and they were able to predict that their revenue was gonna go up 1%, which is a significant amount of money. That's a brave person to say that, but they'd actually base their conversation on hardcore analytics trying to do that on isocurrencies. Well, if the isocurrency file disappeared, you're probably not doing business. That's a risk. Do you wanna put a quantification around it?
This balancing act we see is between like a risk posture versus a genuine, you know, return on investment in the dataset themselves. If you start with the premise that data has to be ingested to your organization, you can then do the ROI of taking a service like ours because you can literally just go through the list. And if you've got stuff in inventory, your time, it's hard to value time to market, but you can put a value on it if you want to. Yeah. With the rest of it, it's just a straight comparison of, you know, we've got engineers doing things that you probably want. Yeah. They do it once. You get charged a smaller amount than it costs us to do it. In terms of delivery, we take care of the operation for you. We also take care of capacity.
A lot of organizations don't have capacity. So we're talking about cost. Assuming they want the data, they need to have people to do it as well. It's great. We have some great tools. We're investing in our tools. But, honestly, you need people to operate the tools as well. And organizations which are scattered in terms of where their data is, they find it very difficult to actually even work out what the genuine cost is. Let them know get to the point of the return. And some organizations have addressed this with organization structure. They've made the businesses that are buying the data responsible for writing the check. It's only when you see you got big central teams that are trying to, you know, if you like, you know, suck in the whole of the universe of data and then redistribute it, then the value judgment becomes a little bit more divorced from the end consumer. If you can try and link the end consumer directly to the bill, then at least they're culpable as part of that process.
[00:26:31] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting. It and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. DataFold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying. You can now know exactly what will change in your database.
DataFold integrates with all major data warehouses as well as frameworks such as airflow and DBT and seamlessly plugs into CI workflows. Visitdataengineeringpodcast.com/datafold today to book a demo with DataFold. And in terms of the platform that you've built to be able to help organizations amortize some of the cost and risk associated with bringing in these external datasets and maintaining them and managing them and ensuring that they're not just going to break, and then you have to reinvest your own money and time into, you know, repairing the web scraping or adjusting the schema resolution and transformations that have to go into that. I'm wondering if you can just talk to some of the overall design and implementation of how you've built CrUX.
[00:27:57] Unknown:
It's been an interesting journey. Yeah? So I think we all start out with, great intentions that we've got the right technology choices straight away, right out of the gate. Well, you know, we've maybe evolved a bit from where we were maybe 3 or 4 years ago. So our base is we did invest in a platform. Okay? So what what does that platform really give us? It gives us a way of accelerating with confidence onboarding of data. So we have people. We have an onboarding team as well that uses our tools to do the work, but we have a Java based, what we call the PDK, which is the pipeline development kit. It's really a pipeline platform that all of our tooling uses.
We have a fairly sophisticated profiler that will connect into, you know, sites. It will do things like look at, you know, a simple file that's a zip, open it up, let you profile which files you actually want to do. If that happens to be like we run into issues with third party data, like tens of thousands of files that expand back over 5 years or with different schemers over different periods. We've got technology that lets us actually open those up and auto define schemas and associate schemas with ranges of time and ranges of files that we actually load up into the core system. We're all based in GCP as our core processing.
So we make extensive use of Google storage. We make excessive use of file distribution capability that Google has in terms of getting data to clients if they want to take files. When we process the data, we curate it. Now what that really means is we add the appropriate metadata around every data frame that we have. We make sure we track lineage in terms of ingestion dates and a whole mechanism to create a coherent feel for the data. Now we will produce those those datasets that are coming from raw. We'll produce, effectively, Parquet, Avro, and CSV versions. They will be stored up in GCP, and we will also load those up into BigQuery. We'll also load Snowflake.
People want it there. We'll also deliver data directly into Azure, so we have a dispatch service. We separate the problem fairly straightforwardly into ingestion, so multiple adapters at a macro level. A number of the edges are done in Python. That goes on to the core processing pipeline, which is all Java. It runs on top of Cadence. We also have had airflow in our history as well, but we move more towards cadence. And then on the back end, we have our dispatch service, which actually does the adapters in reverse. It will do reliable delivery to endpoints of our choice.
We wrap our core with additional services. We have a fairly early version of this app now, which is called schema protect. We're basically trying to put schema evolution in place so that we can actually do auto combination of tables into a view to actually divorce and then consumer effective from a schema change that is a surprise 1 morning as we go through it. We have a quality subsystem that lets us run effectively statistics and checks across any column of data that we have in the system. And we have the ability to enrich and do cross referencing of the actual schemers themselves. So we can actually understand that an ISIN is an ISIN or a country code is a country code as well.
The journey has taken us company, as I said, it was our 5th birthday today. Probably the last 3 years have been the serious accelerant in terms of what we've been doing. We have changed, I guess, technology once in that process fairly early in the beginning, and we're now into refinement mode. And our energy at the moment is going into better and better profiling tools, ability to actually adapt us really quickly, to connect to clients in a much more straightforward fashion, and multiple different deployment forms. Not everybody wants cloud, funny enough, or they want cloud, but they want their cloud as part of the process.
[00:31:56] Unknown:
As far as the design choices that you have brought into the platform, I'm wondering what were some of the most difficult concepts to get right and some of the things that you had to struggle with to figure out how do I make this a maintainable and a scalable system that is still accessible to my end users? And, also, how have you thought about the compartmentalization and management of the operational complexity that's involved in being able to manage so many of these multi point to multi point connections?
[00:32:28] Unknown:
I guess if you separate them out, the biggest, if you are looking at it in terms of learning lesson, is this drove technical design points. We're not a big data problem. And it was an interesting mindset issue because the way that we viewed big data is okay. You mentioned Hadoop. Yeah. So you throw infinite amount of compute onto your your map produce and see what comes out the back end. That's not the business I'm in. That might be the business some of my clients are in, but that's not the business we're in. We're in the business of moving and running to your point about operational resiliency.
Extremely large numbers of pipelines extremely well, extremely quickly, and with active resilience. So part of the as soon as we did that, it does inform your tech choices. Yeah. You don't necessarily wanna run to Hadoop. Airflow's great. Nothing wrong with it. Okay? Airflows have matured dramatically over the last 3 years, but it wasn't really fit for what we were trying to do 3 years ago. We needed low latency between steps. Perhaps we didn't use the technology correct. I mean, everyone assumes that they're the best developers on the best technology all the time. That's not necessarily true. If you're even close to, you know, the cutting edge, then the likelihood is you're gonna fall over a couple of times as you go through it. I think for us, just going back to large scale distributed engineering is a core differentiator.
The fact that we are multi zone effectively, we have full Doctor. We understand. Our solutions span multiple verticals, but we have a lot of folks that understand the concept of a living will. The people care about their data. They care about what happens to your company. They care about disaster recovery. You can't really deal with an exchange or a big bank and just say, don't worry about it. If if the site goes down, you don't really need us, do you? When you're actually trying to plumb into the lifeblood, the fabric around the technology of process, We have a. You know, how do you make security policy?
How do you make sure that you always pass your SOC 2 type 2 type of audit as an operator? How do you make sure you don't, you know, effectively breach any entitlements for suppliers? Those things are really, really important that have to be factored into the system itself. Moving here from big data to large scale distributed changed our viewpoint on how we were gonna architect the system. It also changed the economics. And this is where some of it comes into you know, everyone thinks the cloud, you mentioned a little bit earlier, Tobias, is like, it's not quite as clear cut that data, you know, doesn't matter about storage. Yeah? We have a lot of storage.
And frankly, yeah, we went through a purging process maybe 6, 9 months ago. And we found that, you know, we could save a reasonable amount of operational cost just by tweaking how we were doing things, how we were compressing things, how we were storing the metadata. Yeah. That whole process of constant reassessment is absolutely necessary, but it has to be done on a solid foundation. If you don't get the foundations right, if your API gateways are not working, if your datasets aren't being delivered, if the pipelines are going down, you know, you don't even get out of the gate. We've got some fairly demanding customers behind us, at least, in terms of that, and they definitely let us know in the early days that they weren't happy. And then when they are happy, you get silence. Yeah? But it's like you get into that point. Now operational insight, you raised that, is an interesting 0.1 of the core differentiators at the beginning was just we call it notification service.
So when data's ready, hey, here's a webhook. Here's a Pub Sub. Here's an SQS message. Here's a row in a database table. Data's there. Now pays your money takes your choice, yeah, in terms of how you want to take that data, but it's pretty important because people wanna be triggered. They don't wanna poll. So we found that was a was a very big thing for us. Full blown schema repository, extremely important. You know, it's not that we didn't have schemas, but putting it into a fully managed service, exposing an API around it, clients are almost as interested in the documentation, almost as interested in the schemas as they are in the data. It's not obviously not quite true, but it depends on what part of the life cycle you're in. So there's definitely, you know, schema, schema, schema is king of the hill.
And then I guess the last 1 I'd chuck onto this is, you know, is technology. It's a bit RTFM. Documentation. There's so many different files that you pick up. I think it used to be a pride of it used to get the badge, you know, as a technologist of not reading the manual. Well, you sorta have to do that with data because sometimes even the name of the fields aren't even in the files. Yeah? They're actually externalized. So, yeah, a lot of energy in terms of how do we make real sense of that. We've already got NLP processing for an ML for doing match, entity matching across the return of datasets. We've already built that. And now we're starting to actually look at how do we use those techniques to really get into the documentation that comes from suppliers and make that whole process of onboarding even easier.
Yeah? Points Skynet at a documentation file and, magically, the pipeline appears as a pipe dream. But certainly, pieces of that can definitely be automated as we go. Another interesting element of the situation that you're in is that you have all of these
[00:37:56] Unknown:
upstream data providers that you need to integrate with. You need to bring that into your system, and then you also need to be able to integrate with all of your customer systems. Wondering what your engagement looks like as far as onboarding a new customer, helping them get set up. Obviously, you have multiple different ways of being able to populate that data. But what are some of the mechanisms that you've built to be able to simplify that integration point both from the customer view, but also from your perspective where you say, these are the ways that we supply to integrate. And, you know, obviously, you don't wanna get into the situation of having to create a custom solution for every single customer that comes along because then you're not gonna make any forward progress on your business. You're just gonna be spending all of your time maintaining these multiple bespoke solutions and just how you've thought about what is the common denominator, what are the supported interfaces that you are willing to maintain, and how you have that conversation with your end users to say, no. We're not going to build something special for you just because you have some system that is, you know, horrific and should not see the light of day, but that's what you're running. And just helping people maybe add new capabilities to be able to use your datasets if they don't already have that as part of their platform.
[00:39:07] Unknown:
Most consumers turn up with a very simple request. Could you please make sure, you know, my bucket is populated with the files? How do you do that? I mean, it sounds stupid, but I wish that we were in, like, a machine learning, AI assisted world. The bitter reality is, as most data engineers know, is you live in a world made up of files in a large amount. We're probably the biggest, you know, shipper of files around at the moment. So and most clients when they turn up are even some of the biggest. You get the request for sharing, database sharing, and that is a really interesting tech, although it comes with some operational trust issues that haven't really been sorted out in the industry yet. Like, if I'm sharing data, what stops you deleting it? I promise not to delete it. It's probably not gonna get, you know, cut much. So many people are doing, like, database materialization, love it because it's very easy to get in the organization. I wonder how much of that is about bypassing an information security review of some form than it is actually the technology.
But on the other side, they typically materialize that data set pretty quickly. Now that sort of and I wanna be de minimis to say FTP, but that sort of FTP. The reality is when people turn up, they're going to the cloud, but they're going, you know, to an s 3 structure, a blob structure, and as your GCS buckets, you're typically the the request is that. Now when we started out, in fact, this is probably about 4 years ago, everybody thought that actually API, API, API. Yeah. You talk to anyone. It's just the API. It's not. I mean, you can do it, but and we built a very large API infrastructure for grabbing files when we set on top of Google. So there's there's not really a capacity issue for streaming files down.
But that mechanism is quite limiting Because, really, what you're doing is you're waiting for a, you know, a notification that a file's ready. You'll put it down on the client's infrastructure. What we found was actually just, you know, take the standard files and push them. This typically what we get asked for. That's it is changing a bit. But, frankly, our investment back then was, okay. Look. Files are files. Let's just build a bunch of, you know, endpoint dispatch adapters. Just make sure they're completely reliable, they fail over so we can get data to any of the clouds. And we were expecting to see a, oh, we want this, and we want this, and we want this. It hasn't appeared.
There have been a couple of edges where, like, third party, like, applications that want to take data and then they service their clients have wanted some bespoke file formats. But CSV, I hate to say it, I prefer Parquet and Avram myself, but CSV is still pretty much up there. If you're looking at GitHub, you know, data distribution stats of the year, you're going to find CSVs very high up. You know, database sharing, yeah, we did that. Been there, done that. Everything else is whether you wanna feed Databricks, you know, you wanna go into Athena, you wanna do Redshift, you wanna go to to, they all fundamentally plummet to the file level because that's where, funny enough, clients have kept their data. So they're in a full blown lockdown proprietary DB, an Oracle or Sybase, you know, you name it, or you're in files.
So we've actually found that architecturally, where I think we've got the right model because we plumbed a couple extras in, but we haven't really seen a desperate demand for that, if you like, that last mile transformation. We do see demand for transforming of the data, not the channel. So in that piece, we've actually we've just taken an industry term and just use it ourselves, which is wrangle. So we provide an ability to transform data with complex rules, add datasets together, split them, slice them, dice them, you name it to them. And what we do is we produce another data product, and then that data product is distributed over a pipeline. So just like we do pipelines in, we treat that as a pipeline out. So it's a little bit of the snake eating the snake type approach for us. So we use exactly the same technology for internally as we do with our clients, and we have seen that. There's an extreme end of that, which is like the last 10 feet. If you're gonna think about it into an organization, and then within that organization, getting out actually to the desks. We're not playing in that space as much at the moment, but we've got a certain amount on our books to allow micro customization of end transformations from a client's perspective.
[00:43:48] Unknown:
Watch this space on that 1. Another thing that you brought up earlier is the question of data quality, data monitoring. I'm wondering what are some of the ways that you've invested in that to ensure that as you're delivering all these datasets to your customers that the expectations are met by from your side, but also that you're able to signal to the customers, these are all the validations that we've done. You can take that information and feed that into your system to make sure that as you're ingesting it, that you know that we've done our job. There's an implication that it is my job in that conversation.
[00:44:23] Unknown:
Yeah. So it's very interesting. So your view of quality is different than mine. Guarantee it. I will guarantee it's different than somebody else's as well. Yeah. I mean, I always laugh about this. I'm a big science fiction fan. Trust me. 19 fifties b movies certainly hit my quality point. They don't hit my wife's when we're watching movies. And that's sort of the base problem that we have with data quality. The approach we've taken is try not to be arrogant about this. We don't necessarily know best. We definitely have a view. So there's a the way we do this is as long as the data that's being loaded on a daily, minute, hourly, yearly basis from the suppliers is fit for purpose and basic fit for purpose.
It loads. It's not egregious. It's not crashing. That's the minimum that we will supply to clients, and we have clients that want that. They want it processed, wrapped, but close to raw. That's the best we can do on that side. So we don't manipulate the data values at all ever. We're always in a position where we can deliver it, which means it has to pass technical criteria, but it doesn't have to pass semantic criteria. The semantic system that we put on the back of this, we use ourselves to check, you know, if you're not when we're assessing data, when we're onboarding, but we don't actually publish that assessment. There's a conversation we have on a fairly regular basis of us setting a standard for each 1 of the data frames that we load.
We're a long way into that conversation, so there's almost like the home buyers report for data. It's certainly a concept that we've looked at. What we're seeing from the clients is the opposite, though. They sort of want the home buyers report, but then they want their own surveyor to go in and check the house out. So the the way we've done the tooling is to allow a very sophisticated and effectively matter of queries to be run across any of the datasets we're loading against any delivery as it happens and to actually go through and instantiate the rules that the client wants.
But then the threshold of pain, the notification pain, is with the client. So we've explored whether we should do the escalation for this. Yeah. Operational services to go back to the supplier. It's next to impossible if we think to agree an SLA for that service with let me phrase it. With multiple end clients. I could do it internally if I was with a company. I could maybe do it if I wanted to, you know, throw money at it for bodies to do and do it. But to automate that and provide a workable estimate that met the business criticality, that's really, really hard.
So we've gone for the, you know, tool the client and let the client know, provide an easy way for the client to make a client's decision on whether that data should be used or not used as we go through this. It's sort of an underlying philosophy to enable the client. And, yeah, certainly when it comes down to things like this or creating security masters, yeah, there's not a lack of knowledge and not a lack of skill with the team. It's a lack or a distinct fright of what might happen if something went wrong with that if you're not within the organization that actually is is qualifying that data. Sorry. There's a bit of a long answer, but it's an interesting topic about whose responsibilities it is for this data. Absolutely. No. It's definitely an interesting area to explore.
[00:47:43] Unknown:
And so in terms of the ways that you're seeing customers use Crux and the types of data that they're bringing in, I'm wondering what are some of the most interesting or innovative or unexpected ways that you've seen your platform employed?
[00:47:57] Unknown:
I'll start with gratifying versus interesting. Clusters shut down myself. But if so, we've been partly to migrations, whether it's technology, like, you know, Hadoop to I've seen Hadoop to Snowflake. I've seen proprietary DBs to BigQuery. So being part of that, because acting as a sort of a hub, a non marketplace hub is probably the easiest way of doing it. We don't sell data as we talked about it earlier. But acting as a multiplexer gives us a very interesting point in the flows. And when you have a lot of inventory and you talk to a lot of suppliers, we have over 250 suppliers we do have good relationships with, Yeah. You start to actually get critical mass to make that speed to market really pay off.
So I think, certainly, technology migrations, cloud migrations, what are the most innovative I've seen? I guess, trying bizarrely, 1 of the more innovative was actually trying to get to grips with the costs, actually forcing a data engineering exercise with us to really understand the cost of incremental data so they could go back to their business. And it's not not technically innovative thing, but it's commercially quite an interesting approach for that client to have gone after. And we've got others that are moving thousands of pipelines, and that's more of a general replatforming for the cloud and preparation for a massive expansion of what their clients are gonna wanna do in the clouds as well. That edge case is innovative where, you know, encoding data into a completely different format on the way out as an adapter to really make a risk management system work extremely well. So there's been a number of, if you like, edge cases. But the mainstay, honestly, is people just want their pipelines and their data to be delivered. They want it reliably there. They want it 247. They want somebody else to care about it and to chase down the supplier.
And when it breaks, could you please fix it and please try not to break it too much? Yeah. There's a certain desire to not have to scale their own internal resources to have to do that piece of the puzzle. And it's it's really hard, yeah, at the end of the day to scale this business.
[00:50:07] Unknown:
Tired of deploying bad data? Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes. Build powerful workloads that connect your entire data stack end to end with a mix of your code and their open source low code templates. Once launched, Shipyard makes data observability easy with logging, alerting, and retries that will catch errors before your business team does. So whether you're ingesting data from an API, transforming it with DBT, updating BI tools, or sending data alerts, Shipyard centralizes these operations and handles the heavy lifting so your data team can finally focus on what they're good at, solving problems with data. Go to data engineering podcast.com/shipyard today to get started automating with their free developer plan.
To your point too about the situation where a particular data feed breaks, 1 of the typical approaches to managing that, at least in the batch context, is to manage a backfill operation. And I'm wondering how you approach that situation with your end customers where you say, okay. There was downtime for this data source from, you know, say, it's for 2 hours. So now I'm going to backfill, you know, to your bucket for that 2 hours or just how you manage some of that operation, particularly given the variability in the ways that the data is being delivered to you from those source systems?
[00:51:32] Unknown:
The good news is that within a given fee, there isn't much variability. There might be a gap, but it's a consistent approach. We will basically we we treat it as a full replay. So, effectively, most suppliers keep a sensible amount of data that they're supposed to have shipped to available for you to get. Now there are the edge cases, not necessarily that edgy, but, APIs or databases that are exposed that you need to pull the data down. Yeah. Mostly, we basically will replay the files. So we don't do replacement. If you look at the whole of the chain, obviously, this failure is possible across multiple parts of the chain.
So assuming that we get the data, then the files are safe and sound, frankly, sitting in GCS for us. If there's something really weird about that data, then those files are gonna pile up while we work out what the hell was going on. It could be that I've changed schemers, for instance. Yeah. And suddenly the code is no longer happy with what's going on. In which case then that's treated as a major issue for us. Instant management and process kicks in, and then our data engineering team will look at that feed specifically and urgently, not just from an infrastructure perspective, but from a data content and engineering perspective to get it up and running.
When in those instances, what will then happen is that process will kick off, and it will replay those files in order. So we work on a additive basis. We're not trying as a first level ingestion to try and give you deltas. That comes later. We can do that, but that's not part of that's also, frankly, 1 of the easiest mistakes to make in this business. Try to do too much processing on the data coming into the system? No. No. No. No. Yeah. Honestly, been there, done that, don't do that. Yeah. Get the data in. Get it as close. Get it manageable as fast as you can, and then do post processing on that as the, you know, secondary steps as you go through. It gives you a lot more flexibility in in certainly in these scenarios.
So we will load the data up, replay it, and put it in. Backfills are also there for significant onboards. So we've had people asking for, like, 20 years back history where the suppliers actually got those files going back, you know, 10, 15, 20 years. And we will actually download those. We'll do the profiling, and we'll run those through. But from our perspective, that's sort of normal business. So we'll get that data. We'll play it through the pipelines. We'll get it up into the main system. Yeah. And we'll deliver it off to you. If someone wants to take a massive know, exercise to dump, you know, petabytes of data, well, we'll support that. And probably we'll be orchestrated with them.
But we don't have many systems which are streaming APIs that roll really gappy from a financial trading perspective. Okay? That's that's not what we designed for you. If you wanna do that, go buy, you know, RMDS or the appropriate system. And then you're never gonna fill your gaps anyway. So
[00:54:25] Unknown:
And in your experience of working on the crux platform and working with your customers, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:54:35] Unknown:
And I keep hoping, But there's like, I wish it was different than any other part of computing, and it's not. I mean, it's almost, you could ask you interesting, innovative, unexpected, or just frankly disappointing. The space is really, really interesting, but it is processing. We sometimes wrap this up into special terms, and frankly, I find that a little annoying because it's like it's all like an another layer of complexity. There's very specific tasks you need to do with pipelines, and there's some really interesting things you can do. And the industry is maturing in its language, which I really like. But the reality of this is this is largely large scale distributed engineering.
Large scale I mean, it's almost I'm very fortunate to have extremely good infrastructure folks on staff. Yet, the cloud doesn't make all of their skills go away. You need a different set of those skills. But there's plenty of ways of getting latency into your system if you do not know what you're doing with Airflow and and GCS, for instance. Yeah? So I think it's like making sure you keep track of your roots is the way that you can maybe become a bit more innovative. But making sure that you understand how your system really works, the dynamics around that, there's nothing really that new here in terms of the overall space is how you've built your systems and your platform to support that. I view data pipelines largely as a very big application that needs an awful lot of, you know, I feel like care and attention on an ongoing basis to make sure it can really aid, like, the users.
And the users are actually typically data scientists maybe at the end of the stream. But clearly in the middle, there's a bunch of data engineers that need to be able to do their jobs really well. So no, I don't think there's anything new, particularly. It might be new technologies, but there's nothing new per se in in actually the process from my perspective.
[00:56:28] Unknown:
Absolutely. So for individuals or organizations who are looking to be able to work with external datasets, what are the cases where CRUX is the wrong choice and they're better suited just building those integrations themselves or maybe working with a data provider directly to manage those integrations?
[00:56:47] Unknown:
Right now, point in time, connecting inside of their own firewalls is likely to be challenging for any company that's trying to do this type of stuff. Yeah? If you're trying to move systems from a to b and you have effectively an edict, which is you can't let that data go outside your firm, then clearly we're outside your firm. So that's not a very good fit from that perspective. But I also think it's a question of scale. Look. No 1 is too small to use the system, but there is a minimum entry overhead, yeah, cost that you have to go through. We've seen a lot of success with start small and clients expand fairly dramatically in terms of their usage base. So we don't turn away business on that perspective, but we aren't your you know, sit in the corner with your Selenium developer that's going to crack something up and just put it live tomorrow.
We're the people that would maybe take a little bit longer than that, but we'll be in a position where that feed is gonna work for the next 10 years, or we're gonna have people on it as part of that process. So some of this is really mindset and where you want to be. If you don't wanna invest in the platform, we're the right we're a very good choice. If you've got a lot of sensitive stuff that you believe is really your special sauce, and that is fairly far between, to be honest, then maybe that's your rationale. Or maybe if you can't, if you don't know really what you want. Yeah. But these are not things which are about per se. They're more generally, if you're going to use a third party to do business with, you need to understand, you know, what you're willing to do and what you're not willing to do. But I think, you know, the core of it would be access to their internal data systems at this point in time. As you continue to iterate on the crux platform and work with your customers and work with the data providers? What are some of the things you have planned for the near to medium term?
There's a significant amount of investment internally in the tooling and making that tooling more broadly available, that left at the moment. But the specifically within that, we have a pretty great profiler for connecting into suppliers and going through that. There's a lot you can do in that space that would really make it really incredible. So a lot of where I mentioned it briefly earlier, the ability to actually join, you know, great technologies to open up files, look at files, profile files, schemers, multiple years, linking that with an ability to really join understanding from documentation, I think it's gonna be massive, if you like, differentiating for us as well.
We wanna be in a position where building pipelines is really easy. I mean, like, really easy. I mean, my tongue was in my cheek when I said, you know, point the tool at the documentation, your pipeline appears. I sort of want that. And I can see a path of doing it. Do you know what I mean? There were years ago, they were talking about, like, you know, trading systems were gonna get replaced by or say traders were gonna get replaced by these machines. And some of it happened, but a lot of the positioning was very much a trader assist type of technology exist came into existence to allow them to do their jobs better.
I view us as being a substantive if you want to use us to build your pipelines completely, that's great. If you want to use our tech to basically speed up your data engineers, that's a fruitful road with us as well, to be honest. And I think if we can continue where we are, I think we have differentiating tech right now. If we can actually plumb that in with the ability to basically just understand documentation, recommend schemers, create automatically data dictionaries, cross references, data curation. That's where I think it's a leaps and bounds ahead of what we're seeing in other places as well. Bit of a hobby horse to us. I think that we do need something in the ML space that's specifically for us versus what we do for our clients.
[01:00:43] Unknown:
And 1 thing I was just realizing we didn't really dig too much into is what's involved in your process of identifying and assessing and integrating with different data providers and the onboarding process of bringing them onto your platform.
[01:00:59] Unknown:
That's actually fairly easy for us. I mean, again, most of these things don't start with technology. They start with a philosophy. Okay? And the philosophy of this is really easy. We don't want a supplier to have the ability or want to say no. So the way you get around that is you make it as easy as you possibly can. When we've been asked by a whether we've gone independently to a supplier and said, listen, we we think this data sets will be really good, or we've been asked by a consumer to onboard data for them, which is also very common, then we turn up and say, listen, give us the data the way that you want to give the data.
We'll take care of the rest. Now largely, if we go back to the beginning of this, how many flavors really can there be? Yeah. You can guarantee that if a small data startup is doing something, it's probably gonna be in a file. Yeah. Or it might be an API that you maybe you need to do some WebSockets to. Yeah. This stuff isn't scary to do that, but it's a bit different than saying, you know, I need you to code this API so you can feed me the data to give to the client. Or I need you to transform it like this, and I need you to give it to me in that special file format to do that. We remove all of that. We're sitting here and and we take the pain of it. In some cases, there may be some esoterics where we'll have to write a custom adapter in the short term. But we'd rather do that and not have the supplier have the pain and increase the velocity we believe for our consumers by doing that than taking the high ground. So we our approach to them is typically very straightforward, turn up and do it. It's really the data nuances, which is where it becomes more problematic, not the connections for us. And within that, our data engineering team, you know, will actually read some of the documentation, will liaise backwards and forwards with the supplier if it's unclear, And generally do those activities that you'd expect the data engineering group to do if they don't understand the data. Now formats will liaise. It depends.
There are some cases. I said a lot of the cases we take data in is, like, we try to keep as close as possibly we can to the rule. There are clients out there that actually, frankly, it makes good sense for them. Don't want that. Silly things. Could you please pivot the data? So 3 years ago, we would have taken the file in. The pipeline would have pivoted the data. Now we'll take the file in as it is, and then we'll run, effectively, a transformation exercise on the back end of that to pivot the data on the way out for the client because we believe that gives us the maximum impact for the client and multiple clients if other clients want to do the same thing.
[01:03:37] Unknown:
Are there any other aspects of the work that you're doing at crux or the overall space of integrating external data sources or helping customers understand and amortize the overall cost of data ownership that we didn't discuss yet that you'd like to cover before we close out the show?
[01:03:52] Unknown:
1 topic coming up a lot with, certainly, CEOs is describing data. We're talking a lot about the mechanics of bringing data through the pipelines, which is critical. But that point about the beginning of trialing data, but also making it easy to integrate, A lot of that comes down to knowledge. So we're starting to see things, whether it be something like, you know, Finos has launched a legend project, which is effectively a meta layer of describing data. Other initiatives in the industry to actually describe schemas and what's actually in the datasets themselves, I think, are very, very important for us to all get behind because that will actually start to give us a richer way of integrating data across the board rather than using humans or using machines to try and deduce things. It will be a lot easier if both sides had a clear expectation, dare I say, contract,
[01:04:41] Unknown:
in terms of what the data actually looks like and is being described that way. And we're certainly seeing a rising amount of interest in that in the industry as well. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:05:06] Unknown:
I think it's 2, genuine lineage all the way through. I said 2 because I'm not sure. Like, it's on a good day, I I agree with that. On a bad day, I think people are just saying they want it and they don't use it. And the second 1 is absolutely the ability to deduce from documentation what is going on or what a pipeline's gonna look like to get real knowledge out of that. And that's why I'm intersecting with the way to describe data. If you could describe data, you could describe pipelines. You could also generate pipelines really easily.
[01:05:38] Unknown:
So that whole thing, I think, today starts, unfortunately, back in documentation. So documentation parsing for genuine semantic extract would be where it would be. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at CrUX and your perspectives on how to understand the cost and value of data for people who are looking to bring in data sources that don't wholly own and just the overall complexities of dealing with these, you know, point to point solutions of external data providers and the customers that you're working with. It's definitely very interesting problem space. So I appreciate all the time and energy that you're putting in to make that a more tractable problem, and I hope you enjoy the rest of your day. Thank you very much. Been a pleasure to be with me. Thank you. Thank you for listening. Don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest on modern data com to subscribe to the show, sign up for the mailing list, and read the show notes. And com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you learned something or tried out a project from the show, then tell us about it. Email hosts@pythonpodcast.com dot com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers!
Introduction to CrUX with Mark Etherington
The Origin and Mission of CrUX
Challenges in Data Management
Types of Data Sources and Use Cases
Cost of Ownership in Data Management
Design and Implementation of CrUX
Customer Integration and Data Quality
Innovative Uses of CrUX
Lessons Learned and Future Plans
Onboarding Data Providers
Describing Data and Industry Standards