Summary
A huge amount of effort goes into modeling and shaping data to make it available for analytical purposes. This is often due to the need to simplify the final queries so that they are performant for visualization or limited exploration. In order to cut down the level of effort involved in making data usable, Matthew Halliday and his co-founders created Incorta as an end-to-end, in-memory analytical engine that removes barriers to insights on your data. In this episode he explains how the system works, the use cases that it empowers, and how you can start using it for your own analytics today.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
- Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit http://www.dataengineeringpodcast.com/montecarlo?utm_source=rss&utm_medium=rss to learn more.
- Your host is Tobias Macey and today I’m interviewing Matthew Halliday about Incorta, an in-memory, unified data and analytics platform as a service
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Incorta is and the story behind it?
- What are the use cases and customers that you are focused on?
- How does that focus inform the design and priorities of functionality in the product?
- What are the technologies and workflows that Incorta might replace?
- What are the systems and services that it is intended to integrate with and extend?
- Can you describe how Incorta is implemented?
- What are the core technological decisions that were necessary to make the product successful?
- How have the design and goals of the system changed and evolved since you started working on it?
- Can you describe the workflow for building an end-to-end analysis using Incorta?
- What are some of the new capabilities or use cases that Incorta enables which are impractical or intractable with other combinations of tools in the ecosystem?
- How do the features of Incorta influence the approach that teams take for data modeling?
- What are the points of collaboration and overlap between organizational roles while using Incorta?
- What are the most interesting, innovative, or unexpected ways that you have seen Incorta used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Incorta?
- When is Incorta the wrong choice?
- What do you have planned for the future of Incorta?
Contact Info
- @layereddelay on Twitter
- Website
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- Incorta
- 3rd Normal Form
- Parquet
- Delta Lake
- Iceberg
- PrestoDB
- PySpark
- Dataiku
- Angular
- React
- Apache ECharts
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Outland as an internal tool for themselves. Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3,000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Matthew Halliday about Encarta, an in memory unified data and analytics platform as a service. So, Matthew, can you start by introducing yourself? Yeah. Thanks, Tobias. So my name is Matthew Halliday. I'm 1 of the cofounders
[00:02:11] Unknown:
of Incorder. I'm responsible for product product marketing at Incorder, which has been a journey now for about 8 years. And prior to that, I spent my time at Oracle and Microsoft. So been around the data space for quite some time working with various enterprise applications and providing value from data, analytics from data, and helping businesses try to solve complex and difficult problems. So at Encoda, we really birthed the company out of this vision that there was a fundamental underlying problem that nobody was addressing in reality. And it was before you could do anything with analytics, you had to start with dimensional models. You had to go through this kind of heavy processing.
And we just wanted to challenge that. If there's a better way, could we do analytics on the data in its original form or shape and be able to do that more effectively and efficiently so we could deliver faster time to insight. So we're all about providing customers, users, people who wanna ask questions of data to do that in a very timely fast fashion, not months and, you know, years, to be able to shorten that time down and then be able to provide data in a way that helps them make informed decisions. That's really what we're all about. And do you remember how you first got started working in the data space? Computer science, graduate. I joined Oracle. I started working on databases and working with applications, building data. Very quickly started to realize that I had a fascination with learning with from data and tracking events and things. So the first thing I built was like an application that just tracked environment usage, like what was going on with environments. And there's about a 100 database environments that I was responsible for at the time and who was logging into what, how often and all those kind of things. So I started tracking all of this behavior around these environments, and this was back in the nineties, late nineties. And so kind of got into that. And ever since then, I've always been fascinated about data to the point that I was 1 of the early adopters of, like, heart rate monitors when it was just Polar doing the thing before it became all the smart devices that we have now. And used to use that to go, you know, training, working out, and then analyzing all my data and looking at how to be a better athlete from all the data that I was acquiring. So always had this fascination and interest with how data can help me be more effective and efficient in my job, and, you know, that kind of rolls into other areas as well.
So kind of always had that kind of role around big data applications, worked on some of the largest Oracle accounts with some of the largest data volumes they were producing there and helping, you know, customers be able to leverage that data for themselves as well. And you mentioned a bit about the story behind
[00:04:47] Unknown:
what motivated you to build Encarta. I'm wondering if you can give some detail about what it is that you're creating there.
[00:04:54] Unknown:
As I mentioned earlier, right, everyone starts with the same dataset. Right? You have an application. You have a source system, and there's data residing in a data model that was intentionally built in the data model the way it is. And we refer to this as 3rd normalization form, 3 and f or relational datasets. And that's where you might have something like an order. Right? An order for a product and you would have actually probably, you know, you can imagine the screen. Right? You can imagine, oh, I have a, you know, who's the customer, the address, the bill to location, the ship to location, what is contained within the order, if there's shipping lines, tax lines, all of this kind of information, you could imagine it on the screen in front of you. In reality, behind the scenes though, it's probably 30 to 40 tables that are being used to generate that 1 order that you're looking at. And that process of pulling it together, we all know if you've been around a while, remember the days when you had to call up and, you know, kind of ask for a status on an order? And, you know, invariably, the person, you know, would ask for your order ID or order number, and then they would say, oh, my system's a little slow today, you know, as they would wait 5, 10, 15 seconds to bring up your 1 order.
Reason they took that long is because it was stitching together and doing joins across these tables, and that becomes the most difficult complex thing you can ask a database to do. It's actually so inefficient that the way that it happens is what we refer to as order and log in, which is exponential in terms of as you add joins. So if you're looking at 1 record, you kind of can get through it because it's like 1 joining to another to another. But times when you then say, I wanna look across all my transactions, Well, now you're taking that 10, 15 seconds and multiplying it by all your transactions and it becomes, you know, an unwieldy beast. So that was kind of the world that we all start with and the applications we have. And the problems with those is that they don't lend themselves to analytics. And there's a reasons why we create 40 tables behind the scenes so that we can do things like you can update the bill to location while someone else is, you know, shipping an order. You don't want those things to kind of be all over the map and causing issues there for people.
So that's where some of the reasons we have these. Now, the problem comes when you want to do analytics. When you look at that, you can't run queries that take 23 hours to run, which is not uncommon to see queries just running against this 3 and f model, taking days and then failing. That's kind of the world you live in. So it's like, okay. How do we fix that? So conventional wisdom since even back to the eighties nineties it was funny. I was in the office the other day, and I saw a magazine that was from 1995, and it was saying, this might be the year of the data warehouse. An article is written by Rayne Atkinson about this very topic of how to do analytics, and it was all around you've gotta do change the shape of the data, and you've gotta put it into a dimensional model. And when you put things in dimensional models, you're really changing the whole shape of the dataset.
Now that process becomes very expensive. Changing and moving the data, transforming it, creating surrogate keys and joins, all of those things that go into creating a dimensional star schema model really become restricted. And you have all these different models, but they're not connected as well. So you get very isolated analytics. So, you know, I think Oracle had the database. It has Java. It has Sun Microsystems. It has analytical applications. It has the applications. Had the r and d team actually responsible the deployment of Oracle at Oracle, and yet at the end of a quarter, what would you see? You would see, when we were there, spreadsheets being handed around. There would be, here's all the deals that closed.
And so that's where, you know, people were working with. And in reality, that process is just takes too long. Normally, you can only do it, you know, once or 2 or 3 times a day in terms of refresh rates. And because it's going through that transformation as a model that really is built for the query that you want to ask, what's the question you want to ask? And I start with building a data model to answer that question. So when a lot of people think about, you know, performance tuning or, like, how do I make my queries faster? They think, you know, oh, I gotta make the query more efficient. I gotta write better SQL. Well, in reality, most of the time when we would do performance tuning at Oracle and Microsoft, what we would actually do is change the data model.
We would look at it. It's not always changing the SQL statement. It's like, how can I eliminate some joints? Do I need to denormalize some of this data down because I'm going to cross a joint just to get 1 field that's kind of very inefficient. So you do all of these things, but the problem is that becomes highly restrictive. So we thought, okay, what if we focused on that rather than just, you know, looking at creating another tech stack and another, you know, each set of ETL tools. What if we could say, let's work out how to make a query engine that's specifically built for the relational dataset, the dataset that you find in your operational analytics, the ERP systems, the CRM systems, the supply chain systems.
How do we bring all of those things together? So that's really what, you know, Encora set out to address, and we we did that. We created a query engine where instead of it being exponential when it does these types of joins that are very prevalent in 3 and a half, it actually does them order n. So it makes it feel like that there's really no overhead. So we have customers running with 1,000,000,000 of records on 40 table joins, and it's not table joins from, like, a large dataset to a very small dataset. We're talking about billions of rows joining to billions of rows, joining to 100 of millions of rows, joining to more 100 of millions of rows and but yeah. It kind of goes on like that. A lot of people would do this with, like, yes, we can process petabytes of data and even join it as long as the join tables are very small and can be replicated across all of the nodes. Right? So that's exactly, you know, what people kind of been doing or thinking, but that requires you to change the structure because that's not what your source data looks like. That's not what you start with.
So that's kind of exactly what we kind of got into and focused on in building solutions for that business problem. Like, how do we make that work? And and that's where we hit on something. We figured out a way to make it work. And then the benefits are, well, now we're doing data analytics on the kind of duplicate digital twin of your source application. So it's almost like equivalent to doing, like, maybe bronze or silver equivalent where you don't have to go to a dimensional model. You can do the analytics on top. You can run these queries and get sub second responses. You can change the queries very easily. Everything is metadata driven. It's not creating data transformations behind the scenes or cubes or, you know, materialized views or caching in some kind of creative way to anticipate what you're doing. It literally is just running those queries as you ask them, and yet you're getting those responses. So that's kind of the genesis of the idea and where we went with the product.
[00:11:35] Unknown:
And as far as the sort of technological underpinnings of being able to manage these digital twins of your source data without having to do all of this upfront work to extract it, load it, transform it, model it before you can actually write the query. I'm wondering if you can talk to just the overall system design and architecture and some of the new engineering that you had to do to be able to build out this query engine to support this workflow that is I haven't come across in any of the other conversations I've had for, you know, the specific approach to how to manage the data workflow.
[00:12:10] Unknown:
There's a few things that happened. Right? So 1 of the first things is you look at a relational model and the data is optimized to be record level rights. Right? So when you do that, it's great because you're reading all the information because you're rendering it all on the screen. With columnar, that's the first step. Right? You have to adopt columnar structure because that really helps with the way we do analytics. We generally look at all of your measure, like, for revenue against all of another dimensional value, and we're not really interested in any other value apart from those 2 pieces at that point. So being able to store your data in a column or format is very appropriate and makes a lot of sense. So of course we're not the first people to do that at all and a lot of people have been doing that, but it just makes sense that that's kind of the new standard. So leveraging file formats like parquet file format, you know, we that's how we store our dataset.
Then on top of that, there's what do we do with memory processing? When I started out at Oracle in 97, I think 1 gig of DRAM would set you back about $20,000 When you look at how that change in when MapReduce came out, it was down to about $400 in 2004. If you look at the genesis of Encarta back in 20 13, 2014, that was down to, like, $5 for 1 gig of DRAM, and now it's less than a dollar. Right? It's about 70¢, roundabout. That change is like you architect and build things differently when the constraints have changed. And so it was all about how to optimize for disk. You know, that was 1 thing. But now it's like, okay, now we have memory footprints that we couldn't even imagine. So now your whole entire queries can oftentimes just reside in memory.
Right? The entire data that you're gonna do is gonna be memory data processing. So how do you leverage data? And so really when you think about memory, there's multiple levels to that memory conversation. So we looked at how can we best leverage L1 cache versus L2 versus L3 because, you know, it's the fastest memory that you have in a CPU. How do we make sure that we don't go through the bus, which is slow to go to the memory? When we bring something into memory and we are transacting on it, how do we make sure that it's, you know, faster and it actually is more effective? Well, our L1 hit ratios have significantly higher. There's a couple of reasons for that. 1 is that we keep the data compressed.
So we actually do analytics on compressed data. We're not talking about compression of data just on disk for storage and then uncompressed into memory. We keep it compressed. So we do analytics on compressed data. And then also because we keep in its original form, there is a correlation between what you're reading for the next piece of the query is probably already there because these things haven't been rebuilt, because they're in the natural way that they were created. So when you do that, what ends up happening is you get this much faster response. So leveraging things like that, leveraging the memory footprint in a, you know, kind of unique way as some of the technologies we're able to take. That gets us so far. And so we looked at, like, 50 database papers around how do people do joints? How do people join table a to table b and just table c to get to g?
And every single 1 of those white papers, all 50 of them, had exactly the same way of doing it. It's a hash table function that's order and log in exponential takes a lot of time and makes it the most punishing, you know, operation you can ask a database to do is to join 2 datasets together. Right? So that's where we went out. So we built an algorithm, our hash table join function that does that in a much more effective and efficient way. Now there's other things that are going on in the market that you need to be aware of. Right? So what we saw is also, you know, with memory processing and spark coming into play, it's like, okay, how can we leverage Spark? So there's Spark within our platform as well, which can help with, you know, taking what would have been PLSQL functions in an Oracle database and and being able to use those or even using kind of value add transformations. So when I talk about transformations, I largely put them into 2 buckets.
1 would be a transformation that brings value to the data. It improves the data in the sense of it makes it more valuable. And then the second 1 is it transforms the data in order to do performance tuning to make a SQL query kind of come back with an answer. Right? That's the dimensional modeling and transform kind of piece on that side. So, you know, we kind of look at how can we do this in a much more effective way. And so we have that, you know, capability using Spark. And so we've seen really creative ways in how we can use Spark within the platform as well. And so even roles like companies now doing fixed asset depreciation, which is normally a very intensive ERP, PLSQL, EBS kind of intensive task, moving that out into Pyspark and running it there and then using, you know, these heavy expensive queries that would have been on Oracle database, run those on Encuator, get your depreciation of your assets and then be able to push them back. So we see, you know, some of those technologies. But some of the technologies that, you know, show a lot of promise but don't work for these types of challenges too. That's the other thing that I think people need to be aware of is that there isn't 1 data platform that suits everything. You know, we have a number of customers, 1 in kind of the media, a couple in the media streaming business and 1 of the largest social media site.
And they have digital data platforms that can process 100 of petabytes of data and, you know, whether it's impressions, click information, all those kind of things, and they do it really effectively. And that's generally how people think about these technologies. Right? This like, hey, we put this great technology together, whether using Presto or some kind of technology like that. But in reality, if you throw 56 gigs of an ERP 3NF data model at it and ask it to run, it won't be effective and it won't work. Sometimes people think, oh, like, how big is this dataset? That's not the question. It's what is the data structure? Like, you need to understand that if it's a single table, it could be 300 petabytes and you can, you know, slice it up, farm it across in lots of nodes, bring back your results and get really, really good performance. And there's a lot of data problems and challenges where that's perfect. That's the right solution.
But knowing that how the other solutions or certain other applications that you deal with do need a different approach. Now the thing that I think is interesting is how do you bring these things together? How do we bring an environment where, you know, you're not a company that says, hey, we have 300 petabytes of data that we can process, but we still have to have this, you know, 19 eighties nineties technology for running our data warehouse. And, sure, we might have modernized it by putting some aspects on the cloud, but in reality, it's still the same data transformation, data modeling that we've been doing for the last 30 years plus.
So we look at how do we bring these 2 together? And so within Encarta platform, it's leveraging things like Parquet, having things like Spark, PySpark, Scala available within the platform, platform, looking at those open standards. You can start to do predictive analysis and various other things, but bring all of the business users around the data onto a single platform. But be open to all those other things because I think there's a lot of specific solutions to specific problems that people encounter. So having things like Delta Lake format or looking at, you know, Parquet or if you're looking at, like, iceberg tables, those things make it a lot easier for the data to be transported between different solutions to provide that. So I think there's a lot of things going on in this industry and especially in the last 5 to 6 years where this has sped up quite significantly in terms of how we engage with data and how data is actually even delivered to us as well.
[00:19:41] Unknown:
As you have been going on this journey of building this system and working with your customers on it and trying to keep it up to date with all of the rapid evolution that's been happening in the surrounding ecosystem, what are some of the ways that the overall design and goals of the platform have changed or evolved in that time?
[00:20:00] Unknown:
Yeah. So I mentioned Parquet. When we started out, we weren't using Parquet. Right? So, you know, 8 years ago, we had our own format. We moved off of that a few years into our journey. It went that way. I think the other thing that we're starting to see now is the rise of the table formats. Right? So there's a file format, but now there's a table format. And you can think of this as really being, how do we create the disk or the cloud storage layers to feel like the equivalent of your DBF files in Oracle, really in an open standard way that you can drop in any application on top, any query acceleration engine on top, whether that's in quarter or someone else to be able to run against that data. So I think that definitely seen that shift and change in trying to move towards that. I think there's been also a big shift in people moving away from just doing things in traditional kind of ETL tools to actually exploring a lot now in terms of what can they do with things like PySpark, whether that's, you know, pushing loads back into, query engines to create these transformations, but value add in that and some changes there. The other bit I would say is we've definitely seen a shift in where the focus has been in terms of companies.
A few years back, we're all about we're a Tableau shop or we're a Power BI shop, and that's our standard. I think what we're starting to see now is that people are realizing the visualization is kind of the last. And it's almost like saying, oh, we're an Outlook company. It's like, so what if I have an iPhone and I just want to use Mac Mail? Like, I can't use it. No, no. We just use Outlook. It's like people want to consume data the way that they want to consume it in the formats and the variety of formats that they want. So that really isn't the problem. Right? We should be open to any vis, I would say, you know, being able to connect and to represent it, how that user wants to look at it.
Conversation now has definitely moved away from that, but we're not hearing that so much anymore. Sometimes you get in and people just ask, oh, we use Power BI. Do you have that as a connector? And of course, yes, we do. But the conversation now is more around how do I get data in the hands of my business users to solve their day to day business challenge problems at the speed of which things are changing? So that's been kind of a bit of a shift where it's not about the finished product on a known set of questions. It's more about how do I handle the unknown?
And I think, of course, prior to even the pandemic that started with the China tariff crisis started and there's a whole bunch of like you'd have products on a ship in the ocean. And you're like, if this lands, you know, in a harbor and I can get it off a boat just today instead of tomorrow, there's going to be different, you know, taxes and tariffs imposed. What does that mean? What does that mean for my product profitability? Do I need to change the products pricing? Can I absorb it? All of these questions started coming. And of course, then in the pandemic and now with supply chain and just all the challenges that we're seeing, people are now realizing more than ever that, you know, data wasn't just a scorecard. Data shouldn't be the thing that I do my kind of Monday morning quarterback and just, you know, evaluate how do people do. Really data to be effective within a company is in the moment when the decision is being made. How do I make a better decision because of the data that's informing and helping me, that's alerting me to blind spots that I might have versus you using data to assess, did I do a good job? Right. That's really demoralizing because you're like, well, I did the best that I knew with what I knew in the circumstance.
But, you know, if I had had access to the data that you had after the event, of course, I wouldn't have made this decision. I would have made a very completely different decision. Or if I had the freedom to ask those more questions, to be able to change, to handle what if scenarios, to play out different, you know, use cases or different scenario planning. All of those things become really, really important. And that's where the conversations are to go. And so now, you know, we're starting to see the rise of this in a few different ways, I would say. You know, 1 is people start talking about micro batch ETL, things like that, or reverse ETL, like how do I get back to the action and statement to actually take to do an action? Because at the end of the day, just like in a dashboard, if that's all I do, that's just expensive artwork.
Right? If there's not actually a decision that's been made, you're probably better off not spending all that money on the digital data pipeline that you have and the dashboards that you have. Go buy some real artwork and motivate, you know, your employees and have them look at that. It'd probably be better. But if you can actually get to a decision and, you know, what are the buttons that they can press and pull, how do I help them make the better decision? That's the key thing because that's when data proves its value. So 1 of the things that you kind of look at as well in this space is, is this like, you know, for all the talk of being data driven, it's such a nebulous term. Right? What does data driven mean, and how do I know if we're data driven? Does that mean we use data in our executive meeting last week? Someone brought up a spreadsheet and it looked pretty impressive and nobody could challenge it. And so we went with the decision.
I think, no, that's not being data driven. I think there's markers that characteristics of companies that are truly data driven, that start to look like. So 1 of our customers Broadcom, they used to run 400 queries a day and they were sending out reports and their business users were taking those reports on a daily basis. You know, imagine they go in their email, they get a report, they look at it, and then they go and do their job. Once they got in quota, though, there's certain things have changed. All of a sudden, they went from running 400 queries a day to 70,000 queries per day.
Right? Okay. That's that's a shift and a change in the way people are working and the way people are thinking about data. That's an indicator to me of, like, there's there's data 1 of the things they said, it used to take them 12 to 18 weeks to even just add a single dimensional value. So if you say, hey, I've got this dust scheme. It's great. I'm looking at this. It goes so far, but I wanna bring in this this new value, this new dimension. I wanna look at this revenue by fill in the blank. Right? No 1 knows what it's gonna be this week. It could be different than last week. They turned that around to being, like, less than a 5 minute exercise for them. So it's like, oh, you want that drag drop? Here's your report done.
Again, that just keeps people in that flow of when they're making decisions, when they're being informed, that's the expectation. That's where we've got to get to. So the data feels empowering and not like it's the thing that is almost like my quarterly business review that's that's used against me to show me how, you know, terrible a job I'm doing or how great of a job I'm doing, but really isn't helping me do a better job. So those are some of the things that I see in terms of the shift and the changes. Of course, there's a lot of different technologies coming into the marketplace around that. But, you know, I think the question that people have to keep looking at and asking is, like, are those things what we're driving to? Because technology can be interesting and it can be like a puzzle.
Like, hey. I got this thing to work together. Wonderful. But in reality, it's what's the value you're providing? What are the projects, and how did they change the business?
[00:27:02] Unknown:
Absolutely. And in terms of the sort of personas and use cases that you're focused on, I'm wondering if you can describe the sort of intentionality that you're using to determine what are the core capabilities that we want to bake into Encorda, what are the pieces that we want to allow for extension and integration with other systems, And how does that inform the way that Encorda sits in the different workflows of the various personas throughout the organization?
[00:27:34] Unknown:
We've had a lot of success with people who are responsible for the applications. They're in IT, and they're kind of responsible for the reporting, the analytics on top of those applications, not necessarily always in the kind of the the central analytic team for some of the enterprises. And what we found there is that when you look at that persona, they understand their application, they understand the business use case really, really well. And so we've had a lot of feedback from those individuals on what the challenges they're trying to achieve and do. And so we've built solutions specifically to help in those areas. So that's, you know, 1 of the reasons why recently we were announced by Gartner as a niche provider within the Magic Quadrant, which, you know, is very competitive to get into. There's, you know, 20 companies that make it in. So you have to knock someone out in order to get into that those spots.
But really focused on those use cases, we've been able to bring, you know, a lot of value. Now when there's times where you get off, you know, those use cases, you wanna start to look at other things. It's like, well, how do we integrate and play with those? So for us, it's looking at, you know, where's the value we provide, the uniqueness that we provide. So, for example, once you land data in Parquet, it's great because it makes it accessible to a whole bunch of other applications. So we have customers with integrations with, like, DataIQ where they come in and say, hey. I wanna go and do some data flow processing here and then push the data back to encoder and then leverage it. We have customers that have Snowflake and use that as, like, an archive layer where everything sits, but then they use for the performance acceleration. They're using Encarta to do those queries on top.
So there's ways in which we can create environments where those things work. So 1 of the the interesting partnerships we've done recently is with Microsoft. And so Microsoft realized that, you know, Encoder really understands this ERP ecosystem very, very well and is able to take Oracle EBS or Oracle SAP customers and provide data that can be leveraged and used within Synapse. But, you know, customers were struggling to get and make meaningful sense of that data to get it out of those systems and then to get into a shape and a format. And so using Encarta and our query engine, we can actually create whether it's a dashboard or a dataset that can then be leveraged by another application such as, you know, Synapse that can be used to then power and help them, you know, however they wanna use that data moving forward.
So we've definitely seen that people come in and start their journey from different places. Some of our customers, you know, in the mid market segment oftentimes come in and say, this is great. You've got connectors for NetSuite. A customer came into our product last month. Within 19 days, they came in on day 1. They signed up for an Incorta trial. They then connected to their NetSuite data. On day 23, they were looking at that NetSuite data and then building out their own custom dashboards. And then within 19 days, they were an important customer and they'd already moved forward because everything they have wanted was contained within that. Then, of course, on the other side, you have enterprise customers, you know, some of the largest brands, Fortune 100 companies in the world where they want to know how do you tie into the other systems investments we've made.
And that's where that kind of open architecture, which is, I think, become very common now across a lot of the the more modern vendors that are in the marketplace where we're realizing that, yes, it's not us or it's us and are better together. Right? It's like, how do you leverage these components and do what they do best? Right? So in quarter does 3 and f data models very, very efficiently. I don't see anything in the market that comes close to what we can do that. But if you were to come to see Matthew, I've got, you know, a bunch of log files that are coming off my web servers. I'm not gonna say, hey, you should be throwing those into Encarta. You be thinking, oh, I might use, Splunk for that. Or if you come in and say, you know, I've got some click data that's coming in off of, you know, behavior, IoT sensors and devices, you might think, you know, Presto is a great application for that, just given the volumes and the things that you're looking at. So I think it's really being able to realize that data is different.
Data has different qualities about it and stensible that can help bring all of the business cases that you're trying to solve and bring that together in a way that kind of gives you the best result. Now, the challenges I have been seeing, quite honestly, is that oftentimes in that particular space, the integrations are more complex than they should be. So a lot of vendors, you know, they have integrations, but they're very heavy on the code side. So 1 of the things we focused on is how do we make low code, no code integrations across the board? So how do we have it so that you don't have to come in and do a whole bunch of heavy coding to get some of these tasks? And in many respects, I always think these are very menial tasks. They're not high value.
We want the people who are doing data engineering to do the high value tasks. The more you do that, the more you shine, the more of a hero you become. If you're spending all your time just kind of like, hey, I've got to just peel potatoes because I need a lot of potatoes. Like, that's probably not the magic of what you can produce as a chef. It's like, what can you then do whether you're, you know, creating no chi, no key or whether you're, you know, doing something else with them, that's when you can really elevate it. So it's like spend more time doing that, less time just doing this grunt work. So we spent a lot of time to focus on those partnerships of, like, how to make that more effective and reduce a lot of the coding aspects that are in there that really don't scale because it's not like you have 1 data pipeline. You probably have 1,000 of them. And then if you have to do that and maintain them, you get to a certain point where all you're doing is keeping the lights on. And that's super frustrating where you can't meet the demands of your customers and the new requirements coming in. We know so little of the business data is actually being analyzed. You know, some estimates are anywhere from, like, 10 to 20 percent of business data actually gets analyzed.
And part of it is because the cost of it has just been so much. You have to prioritize what's the data that's most important. But when you can start to reduce that threshold or that cost of query significantly and the amount of effort it takes to answer a question, all of a sudden that data becomes a lot more accessible and much more interesting, so more opportunities come as a result.
[00:33:43] Unknown:
Are you struggling with broken pipelines, stale dashboards, missing data? If this resonates with you, you're not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end to end data observability platform. Trusted by the teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes.
Monte Carlo also gives you a holistic picture of data health with automatic end to end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today. Go to data engineering pod cast.com/monte Carlo to learn more. 1 of the things that you've gone back to a few times is the idea of dimensional modeling and 3rd normal form and the source systems and how you're able to explore your data more quickly and more effectively. And I'm wondering if you can talk to some of the ways that the usage of Encorda and the capabilities that it provides influences the ways that teams think about the data modeling aspect and how they want to land that source data into their analytical systems for being able to, you know, ask and answer questions about it.
[00:35:11] Unknown:
Ultimately, the way people start to think about this is you think more in terms of the availability of the data that you have, the possibilities of what you can answer, and then how do you roll those out in more self-service ways? So a lot of what our customers have been focusing on is rather than being the person that creates the report for that individual, have that data in the platform, and then provide a semantic layer on top of that because here's 1 of the things. Right? I would never wanna put a 3 and f model in front of data analysis. Here you go. Build your Tableau dashboard on top of this. It's intimidating.
It's huge. Now everyone starts with that same model, but it's like, who actually gets to interface with it? Traditionally, in a star schema, what has been nice about it is simple model when you look at it. You have a fact table and you have some dimensions and it's quite simple and straightforward to use comparatively. So what we say is I bring that down though into a layer, into a semantic layer, which is all metadata driven. And that's the beautiful thing about it is you can create almost, like, unlimited virtual stars that can be connected with 1 each other within a semantic led layer as a metadata, not as a physical implementation of that data.
That's where it becomes problematic because you have to write a whole bunch of data pipeline code to make that happen. So when our customers view this, the way that they've started to think about it is how do I create these business schemas, the semantic layer that I can put out and then have my users build whatever they want? So we have applications that, you know, customers have used, like Shutterfly, for example. They had Oracle EBS and they recently just moved to SAP. And that does exactly the same thing for SAP where they took out all of their supply chain planning, their inventory, payables, purchasing, and brought all that data together so they could have clear insight into exactly where they stood at any point in time with their inventory levels, stock levels, days on hand.
Now what they did for their business users is we put these semantic layer in place. They didn't use any of the sample dashboards. They literally just used the sample dashboards to gain understanding, education, if you like, right, to figure out, okay, this is how it all puts together, this is how it works. Those business users actually built their own dashboards because they had invested the time in creating these semantic layer. They were able to build it themselves and ended up building, like, 30 dashboards that enables them to run and see everything within their business. Ended up reducing stock outs from, like, 77 down to 7 over the course year over year by being able to do that. That's kind of where you see the investment. It's like how to get people to feel like they can be self served, that they have access to a lot of possibilities, that if they have new dimensions they want to get access to, it's a simple request. It's not something that's like, okay, back to the drawing board, we're gonna have to tear down this, you know, this star schema and rebuild it.
Those things are just never gonna scale. That's the problem. Like, you can get a star schema to do things, but you can never scale star schemers at the speed that the business needs to move or the amount of data the organization's generating. Like, those things will never work. And also, they can never be real time enough. Like, by definition, you know, you're replicating the data and then you're putting it through a bunch of transformations. You can throw more hardware at it. You can throw processing at it like crazy. It's much more efficient if you can just run directly against it and you don't have to do those kind of changes and work.
So that's kind of been a lot of the thought is just around how to leverage the data in a more meaningful way for the business, how to put it in front of them in a way that they can consume, that you don't have to be the gatekeepers all the time. And yet you still have things like controls. You still want governance, you still want security. You want to make sure that people are only having access to what they have. You want to be able to manage it, see which dashboards are being used, what's the query performance over time, what's the data volumes going in and out over time? Having the visibility into managing all those things so the service isn't disrupted, that people get access to what they need and can have all of that. So it becomes like a different focus, which actually feels very empowering for the people who are the consumers and gets people out of just doing a lot of very busy work or frustrating work because traditionally when I spoke to a lot of data engineers and analysts, they were pretty frustrated because everyone was unhappy with them because everyone wants more than they can produce. So they have a lot of unhappy customers and it's not because they're not sharp, it's not because they're not smart, it's just because there's no way that they could get the teams to be the size they would have to be to be able to satisfy everything that people would need. So, yeah, it becomes very challenging. So this is where a lot of our customers are saying that this is very, like, freeing where they can now start to go and do some more creative work and look into other areas of the business as well. So what we've seen is a shift of data engineers taking on a little bit of some data science roles. No worries, like, you know, hardcore data science, but starting to use some of the same things, same technologies, you know, using Python, using, you know, pandas, using, you know, maybe a gradient tree to look at some kind of, you know, different kind of ways of looking at the data, looking at, you know, how to surface different pieces of that data that might be of interest to the business as well.
[00:40:23] Unknown:
And to that point of the fact that Encora frees up some of the time of these different roles and the work that they're trying to do because they don't have to do as much heavy lifting to be able to empower different members of the business. I'm wondering if you can speak to some of the different collaboration capabilities that Encorda provides for being able to either bridge across these different roles or allow these roles to overlap more.
[00:40:47] Unknown:
Bringing your business users, your data engineers, your data science onto 1 platform where they all use the same data, I think, has been really, really compelling and interesting because really at the end of the day, you want people to be using the same data, the same definition of data. You don't want your data scientists to have a different definition of what revenue means than the business is looking at. And so in quota, 1 is bringing things like notebook experiences into the platform as well as dashboarding capabilities as well as, you know, having this semantic layer, which you can leverage, you know, anything, whether it's Excel, Power BI, Tableau on top of those things, has been very effective in kind of bringing people together on that. It's also because everyone's using the same data with the same refresh rates and recycle refresh rates, it becomes very easy for them to understand exactly what is going on in terms of the data and the way that they interact with it. And so in Incogna platform, we've made it easy because you can control what people can access. So for example, I can create a business set of rules that if business users coming into a dashboard, maybe they can only see transactions that are specific to that region. Maybe you're a West Coast sales manager. You should only see your West Coast accounts.
However, if you're doing data science, you might need to look at everything, but you don't need to see all of the columns like some of the things you don't need to be looking at. Or you need to mask the data or encrypt the data so that you don't actually see the values behind it, but you just wanna get ultimately, you know, is it a 0 or 1? Right? You change strings and various things into just a classification. So you wanna be able to classify the data. You don't necessarily need to have access to everything. So being able to set up those kind of rules within the platform, also even hierarchies, when you think about hierarchies and the way you traverse those, being able to ensure that, you know, you can look at salary information, but maybe you can only look at an aggregate level. You can't go all the way down, but then other people might have to have access to an aggregate level of that information. So having that within 1 platform where you set those sets of rules and you can have different policies clicks just like you would share maybe like a Google Doc or something with somebody. You can do exactly the same thing within the platform. So it becomes very easy for people to collaborate around the data to leverage the same things so that you don't have to build it multiple times. So oftentimes, you'll see business analysts working on a data warehouse, and then just say the data scientist people are working on the data lake. And those things are completely separate, and they've got, like, different pipelines to the 2 of them and they probably end up, you know, with code issues between them. So you get different results and then you've got to figure out whose number's correct.
Like, all of those challenges start to come into play. When you bring everything together and you're kind of consolidated on that, you know, 1 platform or 1 approach for the platform, if you want to think of it that way. When I say 1 platform, I mean, everything's in parquet. You have the same semantic layer. You have the same business rules being applied. You have query engines. Query engines could be different. You might be using Spark's equal. You might be using Encoder's analytical engine for the 3 and f queries. You might be using Spark and using an Encarta query to help create a data frame that you're then gonna leverage and use. So you could be using a business schema. So as a data scientist, I might say, oh, I'm gonna look at this business schema that was created for these business users and do some analysis work on it. So being able to use that same dataset, be able to share the data, collaborate around the same definition of the data makes a lot of sense. And I think there's a lot of value in doing that rather than just kind of doing the same, in essence, work again, but in a different tool sets for different outcomes.
[00:44:33] Unknown:
And in terms of the ways that you have seen teams and organizations using Encarta, what are some of the unexpected applications that you have seen?
[00:44:43] Unknown:
Yeah. So I've seen a lot recently, and I don't know if this might be just something that kind of came out the pandemic and what was going on in the pandemic. I know I hit a lot of, organizations trying to look at product profitability and figure out which products should we prioritize. So, you know, some companies cut down their product lines because they wanted to put out the ones that there were 1 most associated with their brand that you would expect and want to see when you go there and then 2 generate the most revenue. So even prior to that, there's a coffee distributor that we work with, a global coffee distributor that many years ago started off with product profitability, wanted to drill down to, you know, 20,000 plus store locations and look at every piece of the SKUs that they have by day. Like, oftentimes, it's amazing the abstraction level of detail that customers are generally used to having as another kind of big technology box company that sells, you know, a lot of the technology products that you know and love and buy. And they had a similar thing where they would know things like we know that the computer segment for this month did well. But they couldn't tell you if that's because they were selling more Ethernet cables, more services, or MacBook Pros.
It was fascinating, and they couldn't even pinpoint it down to a single store location on a single day and say, wow. We had a, you know, 100 MacBooks sold in this 1 store on this 1 day. They they wouldn't know that level of insight. So we've seen those cases really kind of, you know, ratchet up. I think that makes sense. So is it surprising? No. There's huge value there, so I see a lot of that. I think it's very effective. Some of the ways that I think have surprised me was this coffee distributor. They have, like, loyalty cards. Right? So you go to a store, you use your loyalty card, and you have some, you know, money left on it, and you can start to use it. So there's this accounting policy called breakage, value card breakage.
And what it means is this when you sell a gift card for, like, let's say, $20 and if someone hasn't used it, you can't recognize that as revenue until they start to use it because in essence, you're carrying a liability because they haven't rendered the service. They've just have a commit. Right? It's commit, liability because they haven't rendered the service. They've just have a commit. Right? Commit credits. It's not in essence, something that you can burn down. Now, there are some certain rules, though, over a certain period of time, you can start to recognize it as revenue. They realize that it's unrealistic to think that if there's a card that has $0.25 on it that was used in earnest for 6 months and then disappeared for 6 years, that someone's going to show up 1 day with this and say, hey, I want you know, my discount on my coffee. I've got this card. So, you know, this coffee distributor actually holds more money on these value cards than a lot of banks do in liquidity.
So they have huge volumes of money. And we're talking about being able to recognize additional, like, 50,000,000 in revenue because they were actually able to look at not just 12 months of data because you can think about every single swipe and every single rule that applies into this with, like, 30,000,000,000 plus, transactions on these card stored value cards, they were able to look at their entire dataset. It was something that they were not able to do using, you know, like an Oracle database and, like, using other technologies. And it wasn't until Encoda that they're able to leverage that, which in result increased the amount of revenue that they could realize from, you know, something as simple as better enforcing accounting policy and rules on value cards and what they could recognize as revenue. So that was quite an interesting change. I think probably 1 of the other ones was I talked a little bit about fixed asset depreciation.
So we all thought as our platform is like an analytical platform. But in reality, when you think about it, what did we say we do really well? We do really complex queries very, very efficiently and fast. And then you couple it something with something like a Python spark platform. And then all of a sudden you can say, well, hey, we got these really complex business processes. And so this was a social media company. It has, you know, millions of assets, you know, it's good in its servers, and it needs to depreciate them. And that's a really complex process for running and it takes a long time. So putting on Encarta, all of the queries that were needed to be run as part of that process was much, much faster.
It increases their ability to close their books on time, to be able to close out the month quarter end. So we've seen a lot of use cases that so those kind of interesting when you start to see processing. Another 1 was a company that did their vacation accruals, figuring out exactly all of the rules that come into play, right, for vacation accrual, it can get very complex. So those are, like, more like high intensity data processing that's been taken care of on the platform versus then just thinking, how do I get a dashboard that shows, you know, my spend in these various categories and things like that? So definitely seeing some interesting creative ways in which people have come in. Probably the third 1 I would say is that right in the early days of COVID, there was a top 20 US bank that uses Encuerto.
And literally within 2 days, they created an entire COVID management set of dashboards that would track things like VPN access. Do the employees have access? Were they able to log in to VPN who are working from home? If people did come into an office location, how did they do, you know, tracking and be able to do alert if someone has a COVID exposure? Having all of that information, training, being able to say, do we need to get people training? Do people need even hardware? Some people didn't have laptops, didn't have machines at home to work on. How do you get your entire organization to be mobilized to work remote in a very, very short amount of time and manage and look at that? So a project like that, looking at even COVID cases and intensity, so bringing in external datasets in as well and overlaying all of that became really, really fascinating to see just how quick they're able to turn on a dime and start to implement their COVID policies and procedures.
[00:50:36] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Data fold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying. You can now know exactly what will change in your database.
DataFold integrates with all major data warehouses as well as frameworks such as airflow and dbt and seamlessly plugs into CI workflows. Visitdataengineeringpodcast.com/datafolds today to book a demo with DataFold. In your own experience of building the Encarta platform and growing the business, what are some of the most interesting or unexpected or challenging lessons that you've learned in
[00:51:39] Unknown:
in the process? Oh, there's been many. I think 1 that is kind of serendipitous. Like, when we started in quota, we thought tablets were gonna take over the world. Like, we thought everyone's gonna be on, like, iPads and various things. So at this point, there was a lot of commentary out there that, you know, they were going to take over MacBooks or laptops and everyone would be using these tablet devices. And so we started off by building our product 100% on tablet. So it was all cloud services behind, of course, like you were doing analytics on the tablet itself, but you were building, you were bringing in new data sets, you were doing everything on the tablet. We thought this is empowering, this is the way people will wanna work on the go looking at it. And then we ended up, you know, selling Encoder at a company that makes, you know, very world famous tablets. And they were like, yeah, it's cool. We like it. But you know what? We're gonna really need a desktop application.
And so then we're kinda like, well, if the people who make these tablets really want a desktop application, maybe we really need to rethink whether we think it's gonna be just on tablets. So we ended up implementing what we built in the tablet actually on the web. And so we built a web experience. We didn't build a client. We wanted it to be on the web. You know, that time we built it, it was right before React had kind of taken off. So it was Angular. We built it in Angular and it mimics exactly our iPad experience. And so that actually became a real positive because if we had built from day 1 for a desktop experience, you can easily fall into the trap of making things complex.
When you have a tablet being the the forcing function, what ends up happening is you develop a very simple flight way of doing tasks. And so we did that and it actually worked out really well for us because it ended up being like, wow, okay, this is very efficient. This is very effective. People like the user experience because it didn't feel clunky and complicated and difficult, which is, like, of course, because it was an iPad app or a tablet app. Right? You can't make them ridiculously complex with lots of clicks and double clicks and things like that. You generally have to simplify that process. That was kind of a fortunate mistake, if you like, that ended up paying off really well. There's been other ones where we've looked at before we landed on Parquet and Spark. You know, we looked at various other things. We there's a time where we looked at drill, and we're looking at drill queries and ended up moving to Spark because it was, more effective for that kind of processing.
There's data formats. Like, do you wanna go with ORC, or do you wanna go with parquet benefits? Cost of both went with parquet. There's a lot of things that kind of come along. It's like, okay, do you want to go with Delta Lake format or do you want to go with iceberg table formats? So that's kind of 1 things we're looking at right now. And I'll say we have, Delta Lake format support, but looking at iceberg as well and seeing what are the options that are, like, where are things going. So there's a lot of definitely things that are moving and how to integrate with those and how to leverage those. We ended up moving off of Angular onto React, which is a project that took a bit of time for us to complete, but that has been, you know, a very positive move as well just in terms of developing the application.
There's a lot of those things where we're constantly, you know, spent the last 2 days on an off-site where just meeting with the team just really taking a hard look at, like, what do we have in our platform? What do we have? Because we had to have it, and we had to have it because nothing was there. But what's going on right now that maybe we should be partnering with? Maybe we shouldn't do this anymore. Is there value to our customers? And is that what makes us unique? Just because you have a piece of functionality doesn't mean you should continue to have it. Sometimes it might be more effective to say, like, let's decommission that. Let's use a different vendor or partner with a different company, and then let's put all of our focus into what makes us unique. So a big thing that we kind of keep going back to is what is it that makes us very unique? And it always goes back to it. We can do analytics on the bronze, the raw ingest, the digital duplicate data, the 3 and f data like nobody else. That is what we're all about.
Yes. We have connectors. Yes. We can create parquet and delta like format. Yes. We have all those things. But that's the core reason why people get excited because they're like, well, that's a game changer. That changes the paradigm. That's not just you're a better version of something. So that's where I'm always trying to focus the team is like, let's not just build things because there's an opportunity, but really build things because that's what we do really well. That's where our IP is. That's what makes us unique and different. And, yeah, it's starting to see that message is starting to get traction. You know that when other vendors start to say things that people say, hey. They're starting to sound a little by what you've been saying for the last 4 or 5 years. It's quite rewarding in some respects, I guess, because you're like, okay. That means there is that mind shift which needs to happen first. So when people make that shift, then they're like, well, okay.
Now I want to do analytics on my raw data without having to have expensive data pipeline ETL processing. Now it becomes a question of, well, who does that best? That's great. That you know, take that conversation every single day. Happy to have that. And so it's interesting to see that mind shift and start to take place where people are beginning to question things like the star schema model. We've been running a set of webinars called death of the star schema, and these things have just been off the hook. The amount of people that show up that are interested in this topic and figuring out what exactly is going on with dimensional modeling and how we change this. And that speaks to really not our product directly. It speaks to the shift, the change, the desire to do something different. It's a very hot topic right now.
[00:57:15] Unknown:
And for people who do want to accelerate their analytical capabilities and the speed at which they are able to do discovery, what are the cases where Encarta is the wrong choice?
[00:57:25] Unknown:
As I mentioned, right, where we are focused and where we kind of have spent our time is around the application data. So it is those 3 and f datasets. I think it's not the right choice if you were to come to me and say, hey, I've got IoT data that I want to, you know, go and analyze, or I've got log files that are coming off of my servers that I wanna look into and figure out what to do. There's different solutions that are geared towards that. So if your data is largely flat, so maybe it's like a use case of in the semiconductor, they have a lot of data that comes in for every batch of wafers that they produce that looks for different parameters, thresholds, tolerances, and these things can be like 8 to 10000 columns wide, those tables. Our modeling quota supports that. You can have, like, masses of data being generated, but super fat wide tables where all the data is contained within it, and you're trying to look for anomalies.
Right? So that's where it becomes like there's different platforms. You'd look for, you know, a massive MPP solution that was just like, okay. Let's get, you know, a solution to address that. That's why I'd say, you know, you wouldn't use Incorporate. Incorporate can integrate with those. The majority of that workload would be on something else. But, yeah, definitely the output of that could feed into something. You know, the data from those could then be persisted into parquet and then encoder picks that up and joins it with other bits of data as well. So that's kind of the beauty of what I think is happening in the industry is that I think Parquet or cloud storage, I guess, 3 buckets are really gonna replace what we've seen in the past. So in the past, if you wanted data from Marketo or you want data from Sales Force, the way you do it is they create a public API. Right? So API, whatever, REST API. And you go in and you end up, you know, pulling data from those APIs. They manage them. They maintain them. And then you have to create these connectors and then bring them out.
What I see happening and what I think is going, if I was building an application, let's say, like a Zendesk or something like that today, and I would say my customers want access to that data, which is obviously super common right now, I wouldn't care. I'm gonna create a bunch of data APIs to do that. I would say, okay. I'm gonna get a product. Maybe there's data replication that will land that data in parquet, in an s 3 bucket, in AWS, or whatever they want, and they pay for that. And so you just get that replication done and the vendor provides it. And instead of me having to create and support all of these, it can just plug something in probably in like a few weeks and then release that feature and then start charging for it. So I've been buying some products recently to use it in quarter, and that has been the thing. It's like, oh, I need the data connector. And they say, well, we have an s 3 bucket that, you know, you turn on and then you can read the data from that.
So as you think about that transport data layer or the interface layer now becoming that format, now it becomes interesting because it's like, okay. I have all these different applications coming together. Who can read that data and sit on top of that data and integrate with that data? So being able to read directly from that parquet data and then provide value on top of it and use different solutions for different parts of it becomes, I think, very interesting. So I think that's some of the changes that are taking place in the way that people think about how they access data.
[01:00:38] Unknown:
And as you continue to build and evolve the Encarta platform, what are some of the things that you have planned for the near to medium term or any problem areas that you're excited to dig into?
[01:00:49] Unknown:
Yeah. I think 1 of the things that I always keep asking is like, because we do analytics without changing the shape of the data, what is it that we can do so well that other people struggle? And so 1 of the things has been that we can build applications, data applications on top of the data model, and it can apply to every customer. So we have, as I mentioned, like Shutterfly or this coffee chain or the social media companies or the media streaming companies, they all use our same data application. That is largely unheard of because they have very different ways of looking at the data. And traditionally, before in quarter, they would have had very different data warehouses. They would have looked very different, and it'd be very difficult just to create 1 that size fits all. Even though they start with the same product, they start with the same application. So we've invested a lot in creating an ecosystem around that. So we refer to that as data applications where we provide these quick start ways for people to deploy them and start leveraging them, whether it's NetSuite, EBS, Cloud ERP, SAP, etcetera, etcetera. But what we're working around now is how do we create a framework where there's more of a marketplace where people can collaborate? So a big thing for us is how to create more opportunities for collaboration. So 1 of the things we did recently is we released our component SDK, which is a React based SDK that people can build any visualization and deploy it into the encoder environment, and then they can use our analyzer, which is our builder, to be able to build and put things and build dashboards using those components.
So what it was meant is a guy in my team was able to take that component SDK, grab like an e chart, Apache e chart, and be able to bring that across, import the project, and connect it and deploy it in an encoder environment. So now you can start to think about we can create very creative purpose built data applications that almost feel like video games in the sense of we can really build for the specific use case. When you look at dashboards in general, what I've seen happen a lot is that every dashboard kinda looks the same, but the data might be radically different. So it might be a different area of the business, but it'll have, like, oh, there's a scatter plot. There's a, you know, column part here. There's various things.
Very few times when you look at it and go, oh, that's obviously a sales dashboard or that's obviously a returns dashboard or purchasing dashboard, collections dashboard. But we can start now to build that kind of those things together. Now when we start to get that momentum moving as well, you can then overlay this whole idea of understanding how people interact with data because the data didn't go through a black box, because the structure is the same. If any 1 of the customers are interacting with an object, now we're able to say because of that data lineage is very clear, we can say, oh, you're interested in this column?
People like you were also interested in this other column when they picked that 1. Or when they did a group buy on this dimension, 95% of the time, they filter by this. So I think there's how do we bring the metadata knowledge of how people wanna interact with data and bring that into this so they can start to, like, help people understand versus just looking at what does the data seem to say that's interesting. So that's kind of been the the vantage point that a lot of other companies are taking is how do I surface interested things in the data? I think that goes to a certain degree, but I think that misses out the human element. Like, what are people looking at and engaging with it and tying that back? Because that becomes very interesting because at the end of the day, we build analytics for people generally.
There are some closed loop systems that will automatically do something that kind of take that as a different use case. For this 1, it's like, how do we gauge what's informative, useful for people and kind of bring that to the surface? So I think there's some interesting work around that. There's nothing that is a standard around how to create these data applications in the industry. So we're looking at how do we create that? How do we create a marketplace where you can deploy a Tableau workbook on top of maybe an encoder dataset that helps you look at collections that maybe have been built by a vendor for a specific, you know, industry use case that's kind of helps elevate, you know, their analytical solutions. So I think there's interesting ways in which how do we leverage the community and the people at large in terms of the questions they ask and what's of interest to them versus just the technology for the sake of being interested in technology?
[01:05:21] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:05:36] Unknown:
Of course, 1 of the things we've been talking extensively today is the need to get access to all the data. Right? So I think 1 of the things that's been very limited in the industry at large has been most analytics are on a massively, you know, subset of the data. Whether that's just a subset of the columns that are available, They're not getting access to everything. So that's been a fundamental challenge for people that I think is really not helping what the business needs access to. I think the other thing is when we think about the data formats. Right? So we've seen some great movement in terms of parquet. Parquet is columnar format. And you think about that table formats as well, which I think has been great in terms of the work that's been done there.
But I would say parquet, even though it's new, it's showing its age in a little bit because you could think about parquet as a format that actually needs it has columnar in it, but there's also a play where you could say sometimes some of the data, it's actually best for you to have it in a record written approach where it's not columnar. And so I think we need to start to see more of this hybrid within the the files themselves because at certain points, you need to read all of the rows. And so you might be like, I need to read everything in this area, but then certain times you need to see it as columnar. So I think Parquet potentially could see some changes in that space. I think that would be an area that we've agreed to move kind of that forward.
And then I think on the other aspects, just in terms of the data lineage, the data understanding, semantic layer, there's a lot of things out there. I think in terms of how people define those and get those in 1 place for their entire organization, regardless of where there's data science or data analytics, I think that's a key piece that people seem to be missing. You know, sometimes they focus on just 1 piece of the puzzle. Either my focus on, you know, cloud data warehouse. So kind of go from that and say that's the standard. And then kind of miss out on some of these integration points, which are really gonna be important because at the end of the day, you need that abstraction between what people are building and where the things are stored. Otherwise, you end up with, like, vendor lock in, you end up with a lot of things. I think there's some great movement in this area to kind of expose, may the best technology win, and it may not be because you're locked into data formats, you're locked into 1 way of being able to access the data.
I think some of those openings there will be interesting to see. And then when that happens, I think the area or the gap that you start to see is, well, when I've got multiple solutions that I can look at for this and you've got, like, an encoder platform that's really good on 3 and f, you got Presto that can do a phenomenal job on, you know, large volume data and Splunk or a graph database for something else, then how do you orchestrate between that? The SQL orchestration between those environments and sets of the controls between them. How can you ensure that you're setting the right workload to the right engine? So you might have a query that I could run this against Presto and it'll take 2 hours. I could run this against Encarta and it'll take, like, 5 seconds. Or it could be I could run this against Encarta and it won't be effective to run there at all. But if I run this against Presto, it will be. Having that capability where we think about using these underlying technologies in a seamless way that users don't have to worry about it, they just build they just build on data, and then the workload gets routed to the most appropriate engine because we make a lot of decisions in these products when we build them. Right? We're building for a specific use case. We're testing on certain data. Like, in our case, we're using high volume 3 and f data. Our early customers had huge volumes of data in 3 and f that a lot of people don't get access to. If I go online and search for 3 and f data, it's really hard to find sample data sets that are real world at scale. You won't find them.
Whereas if you're building something like when they're building Presto, they were using a very different dataset as what they were optimizing for. So you make a lot of decisions to optimize when you build a product for the use case that's in front of you. And lots of these systems all had different use cases in front of them, but I think it's gonna be interesting just to see, you know, you plug in quantum computing into that as well. It's like, okay. How do we create an ecosystem where you've got all of these things at your disposal, and we're putting the right workloads on the right technologies at the end of the day? Yeah. Definitely
[01:10:02] Unknown:
a complex and interesting question to ponder. So thank you for sharing those thoughts on that, and thank you for taking the time today to join me and share the work that you're doing at Encarta. It's definitely a very interesting product and an interesting problem space, and I appreciate all of the time and energy that you and your team have put into making it easier for teams to be able to gain useful insights from all of their source data. So thank you again for all of that, and I hope you enjoy the rest of your day. Thanks, Tobias. It was great being here. Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast dot com for the latest on modern data management.
And visit the site of pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Matthew Halliday Begins
Matthew Halliday's Background and Journey
Motivation Behind Building Encarta
Technological Underpinnings of Encarta
Evolution and Changes in Encarta's Design
Core Capabilities and Integration with Other Systems
Unexpected Applications of Encarta
Lessons Learned in Building Encarta
When Encarta is the Wrong Choice
Future Plans and Problem Areas
Biggest Gap in Data Management Tooling
Closing Remarks