Escaping Analysis Paralysis For Your Data Platform With Data Virtualization - Episode 107

Summary

With the constant evolution of technology for data management it can seem impossible to make an informed decision about whether to build a data warehouse, or a data lake, or just leave your data wherever it currently rests. What’s worse is that any time you have to migrate to a new architecture, all of your analytical code has to change too. Thankfully it’s possible to add an abstraction layer to eliminate the churn in your client code, allowing you to evolve your data platform without disrupting your downstream data users. In this episode AtScale co-founder and CTO Matthew Baird describes how the data virtualization and data engineering automation capabilities that are built into the platform free up your engineers to focus on your business needs without having to waste cycles on premature optimization. This was a great conversation about the power of abstractions and appreciating the value of increasing the efficiency of your data team.

linode-banner-sponsor-largeDo you want to try out some of the tools and applications that you heard about on the Data Engineering Podcast? Do you have some ETL jobs that need somewhere to run? Check out Linode at linode.com/dataengineeringpodcast or use the code dataengineering2019 and get a $20 credit (that’s 4 months free!) to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.


Datacoral Logo

Datacoral is this week’s Data Engineering Podcast sponsor.  Datacoral provides an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to construct its infrastructure. Datacoral’s customers report that their data engineers are able to spend 80% of their work time invested in data transformations, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from mere terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral for more information.  


What happens when your expanding log & event data threatens to topple your Elasticsearch strategy? Whether you’re running your own ELK Stack or leveraging an Elasticsearch-based service, unexpected costs and data retention limits quickly mount.  Now try CHAOSSEARCH.  Run your entire logging infrastructure on your AWS S3.  Never move your data. Fully managed service.  Half the cost of Elasticsearch. Check out this short video overview of CHAOSSEARCH today!  Forget Elasticsearch! Try  – search analytics on your AWS S3.


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more.
  • Having all of your logs and event data in one place makes your life easier when something breaks, unless that something is your Elastic Search cluster because it’s storing too much data. CHAOSSEARCH frees you from having to worry about data retention, unexpected failures, and expanding operating costs. They give you a fully managed service to search and analyze all of your logs in S3, entirely under your control, all for half the cost of running your own Elastic Search cluster or using a hosted platform. Try it out for yourself at dataengineeringpodcast.com/chaossearch and don’t forget to thank them for supporting the show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Matt Baird about AtScale, a platform that

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing the AtScale platform and how it fits in the ecosystem of data tools?
  • What was your motivation for building the platform and what were some of the early challenges that you faced in achieving your current level of success?
  • How is the AtScale platform architected and what have been some of the main areas of evolution and change since you first began building it?
    • How has the surrounding data ecosystem changed since AtScale was founded?
    • How are current industry trends influencing your product focus?
  • Can you talk through the workflow for someone implementing AtScale?
  • What are some of the main use cases that benefit from data virtualization capabilities?
    • How does it influence the relevancy of data warehouses or data lakes?
  • What are some of the types of tools or patterns that AtScale replaces in a data platform?
  • What are some of the most interesting or unexpected ways that you have seen AtScale used?
  • What have been some of the most challenging aspects of building and growing the platform?
  • When is AtScale the wrong choice?
  • What do you have planned for the future of the platform and business?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:11
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Lynn node. With 200 gigabit private networking, scalable shared block storage and a 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform. And if you need global distributions, they've got that covered to with worldwide data centers, including new ones and Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to data engineering podcast.com slash Linux that's LINOD today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of this show. This week's episode is also sponsored By data coral and AWS native server lists data infrastructure that installs in your VPC data coral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs rather than pipeline maintenance. Revenue Murthy founder and CEO of Dana coral builds data infrastructures at Yahoo and Facebook scaling from terabytes to petabytes of analytic data. He started data coral with the goal to make sequel the universal data programming language. Visit data engineering podcast.com slash data coral today to find out more. And having all of your logs and event data in one place makes your life easier when something breaks. Unless that's something is your Elasticsearch cluster because it story too much data. Chaos search frees you from having to worry about data retention, unexpected failures and expanding operating costs. They give you a fully managed service to search and analyze all of your logs and s3 entirely under control all for half the cost of running your own Elasticsearch cluster or using a hosted platform. Try it out for yourself at data engineering podcast.com slash chaos search and don't forget to thank them for supporting the show. You listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, cranium global intelligence Alexey own data Council. Upcoming events include the data orchestration summit and data Council in New York City. Go to data engineering podcast.com slash conferences to learn more about these and other events and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey, and today I'm interviewing Matthew Baird about at scale, a platform for data virtualization and a universal semantic layer. So Matt, can you start by introducing yourself
Matthew Baird
0:03:00
Hi, thanks, Tobias. I'm Matthew Baird, I'm one of the co founders of at scale. And I have a long history in, in enterprise software and data and analytics.
Tobias Macey
0:03:11
And do you remember how you first got involved in the area of data management?
Matthew Baird
0:03:14
You know, it goes back a long way. I'm by training, I'm a math and statistics guy grew up and went to school in Waterloo, Ontario, Canada, and I got an enterprise I don't think anybody grows up and is like, I want to be an astronaut. No, no, I want to be a data analytics guy. But I ended up in enterprise and data is really important there. So my first programming job I was around when the internet was being really used to build web applications. And I saw the value of building a web application server that was more data centric, and about pulling data to databases and and letting people manipulate the data to do analytics. And we delivered that as as a product very early I think in 95, actually, and it was excellent cuz Got to work with all the guys that were, you know, sort of pioneers at that level exchanging emails with guys like Marc Andreessen and the team at Netscape as they were building out, you know, that that level of, of infrastructure that led to, you know, sort of what I would call the original data companies, which were companies like PeopleSoft and Siebel and Oracle. And I had a career that spanned multiple visits to those companies, usually through acquisitions of startups that either started or worked at from early on, I think driving intellect and using data to figure out human behavior is something that really drives me a personally I'm very interested in that. It is kind of like the basis of when you look at machine learning and statistics and that stuff at all leads to this. How will people behave when given this data, and I worked at a companies in the startup space that did things like sales, incentive compensation management, which is really about using data to drive behavior of sales people and they're, they're really great for a microcosm of behavior because they are, as we used to say they're kind of coin It's very pure. And then I did a bunch of marketing analyst companies that were more thought of as as you know, the big data companies like conductor which got sold to we work and yield software, which got sold HP, I took a little left turn into consumer with ticket fly and inflection. And then at the end of that I said, You know what, I've been building this platform at all these companies and sort of doing a good job, but not product SIZING IT I should go start a company. And that's when that scale took off. And so as you said, You've been having to replicate the same functionality, a lot of different places. So I'm wondering if you can start by describing a bit about the types of functionality that at scale enables and a bit about the platform itself and how it fits into the current ecosystem of data tools? Absolutely. So I come at this from more of an application builder. I was a VP CTO type of guy working at startups and mostly and a lot of those challenges are, let's collect a bunch of data or let people input data or you know, whatever data comes into this system. And then I'm going to do some sort of post processing transformation on that data. And I'm going to sell you back, either intellect or the ability for you to derive intellect from that data. That's very common. You know, if you look at conductors, a great example, we crawled the web, we collected a bunch of data, we let you enter some extra data from from your enterprise. And then we turned around and we gave you a you user experience that was all about let me look at how people are finding my marketing content and then slice and dice it by machine type, by browser type by geo by you know, these are all business intelligence analytics questions. And when we built those tools, you know, typically what we do back in the day, you know, either use Hadoop or even pre Hadoop, you drop the data somewhere, you post process that put it into a relational database that you could actually query it at a speed that was acceptable for building these user experiences on top of and then you'd manage that whole pipeline and it was kind of some is a big pain in the ass because you never could just Drop the data in the source system that could also serve it. And along comes Hadoop and Hadoop could kind of do this or there was some hint that maybe Hadoop could do this. So we started to look at that technology. And what we decided that we would would build it would be initially a big data analytics or a bi solution on top of Hadoop. That's, it was that easy. We wanted to replicate the functionality of best of breed tools like Microsoft sequel server Analysis Services, or any of the other tools that you see the the enterprise using, but we would be able to eliminate or greatly pushed down the need to do all the, you know, some of this is in retrospect, to be honest, Tobias, eliminate the painful it data engineering that had to happen to make that analytics possible. So we delivered that initially, and it's evolved over time because we started in Hadoop and Hadoop is the challenge of getting performance on Hadoop is really one of data engineering right like It's somewhat query tuning and traditional DB a type stuff for driving performance of a database. But there's a much larger portion of it that's around how is the data laid out on the disk? You know, what are your partitions? Are you doing sorting are you doing? I mean, all the stuff that you do in Hadoop to drive performance. And that led us to automating those identify first identifying scenarios where we needed to do it. And then I automating the, the actual data engineering that happens to to enable this sort of more real and I don't want to say real time, because it's not real time, but it's more interactive business intelligence type query flow. So what happened over time was we expanded from single data warehouse and Hadoop to cloud data warehouses because that was a trend that was happening. And then beyond cloud data warehouses, it would be multiple data warehouses, both on prem and in the cloud. Nobody ever moves all their data all at once. They're constantly looking at, you know, maintaining legacy and pushing on the new tech technologies that are available. So we created a technology that we could leverage the IP that we built around what we now call autonomous data engineering, and move that into the space of supporting multiple databases, multiple locations for the data in a way that was transparent to the user. So from a data engineering data management perspective at scale occupies what I think is a new and exciting space around leveraging virtualization technology to enable and end result, which is analytics. And I think we're the only company that's really doing what I would call an autonomous state engineer for analytics. There are some traditional virtualization vendors that focus on federated queries and caching, those are absolutely necessary. And there they are strategies that need to be implemented, but they're one of maybe two or three dozen different things that we do to enable our core value proposition which we've really simplified this down. We're interested in performance, security and agility and then in the Cloud cost savings is a very real thing. Because I think the second month after you get to the cloud, your CFO usually gives you a call and tells you you're spending too much money. So, so sorry, I talking a lot. But it, what it how it manifests itself is we built a virtual data warehouse, which means we look like we're a database, but we're not a database. And it can connect to any data anywhere, and present it as one universal semantic layer. And we'll talk about what that means. And then you can query it from all your tools. So at the end of the day, you have this one data service, and you can get all you can fulfill all your data needs from that. And we worry about the intricacies of scaling that up from a concurrency and performance perspective. Now, I think universal semantic layer gets a lot of airtime. And just to be clear on what that means. Exactly. There are strings and numbers and databases, we all get that and then in the real world, there are business and real world constructs like a customer or a hierarchy or a dimension of time. Those things have some analogies in this in the low level type system in the database. But what we want to do is up level that to present it in a way that it makes more sense to people that consume data that aren't necessarily experts. So, for instance, I want to drill from country down to state down to city down to specific zip code, that's a hierarchy. That's the thing that exists in in the real world that you can express in a database. But it's not super easy to write that query. There's a lot of group buys, joins all the other stuff that happens it gets even harder if you're doing it across multiple databases. So we present that that semantic layer to the user so that they can they can do the analysis they want without having to worry about the complexities. The end of the day abstraction is a wonderful tool for solving problems, even for business people who don't necessarily understand the concepts around abstraction, why it's valuable, you encapsulate complexity, push it down.
Tobias Macey
0:11:51
And so it sounds like there are a few different concerns that you're covering and your overall capabilities that are partially covered. by some of the different open source platforms out there, such as some of the sequel query engines such along the lines of presto, or metadata management, some level for being able to identify what data do I have sort of the data catalog aspect of things. And all those are useful tools in and of themselves. But as he said, there's a lot of extra engineering time that needs to be dedicated to just making sure that those are running that they're able to perform at the scale that you need, particularly as your volumes of data grow, or as the data sets change, and making sure that the on disk aspects of the data are optimized for those different query engines. So I'm curious if you can talk a bit more about some of the types of tools that somebody might be using and an existing data stack that they've built themselves that they might be able to replace by moving that scale.
Matthew Baird
0:12:45
Absolutely. You're right. There's a ton of open source out there. And I've been involved in open source for over two decades as a member of the Apache group as a PMC for db.apache.org and contributed tons tons of code and tons of my personal time to Open Source and I love it. We've contributed open source at scale as well, we made a decision that we weren't going to open source necessarily this the full platform that we have. And that's more of a business decision. That's a different business model. But the technologies that I think people are using today, let's let's think about that there, you know, clearly, we are not in the landing data space, you set up those data pipelines independently. And we are we come into play after the data lives in a enterprise data warehouse, multiple data, enterprise data warehouses or a data lake as the newer construct for that. And then you have open source technologies that deal with getting that data to the end user who's using a tableau using a Power BI maybe using an Excel perhaps consuming it the you know, JD bc or BBC and building a platform around it, that there's there's a ton of technologies around that sometimes. In fact, the None and it's just very raw let's let's put Impala here or let's put presto, as he said, those technologies are wonderful. And we leverage those and we support those. We make them better. It's like a chocolate peanut butter thing. I think people would say you have the ability to accelerate and scale a dupe in a way that is not involved having to hire and retain very expensive and hard to find Hadoop folks, and it's easy to set up presto, I get that it's very easy to set up Cloudera Impala, and have that work on Hadoop. It's not as easy to get it to work in a way that scales like for instance, at one of our customers is a very large credit card company to scale to multiple petabytes, hundreds of thousands of people accessing it, that requires some really smart folks and they have really smart folks over at that company. They just they have a lot of things to do. So you know if you can automate some of that that's the part where I think we've done a really unique job of delivering us dilution that hits the market where there isn't any open source necessarily to do this around automating the data engineering tasks necessary to deliver you know, a production view of data in a way that's that scales but very specifically that's the thing about this you know there is what is that one from a patchy has something called Carlin, right? Yeah, Apache con, patchy Druid. Those are two technologies that I think are Apache Collins probably more on the OLAP side Druid more of an aggregation engine. Those are interesting technologies. They're super rough around the edges to be clear from I mean, that's it's not a technical assertion. That's more my opinion, they do not wrap an end to end solution. So you're constantly piecing together a bunch of things into the platform. And then you still have that that issue of you've got to do the data engineering yourself. And there's there's no real I don't think there's anything in the in the open source that does that
Tobias Macey
0:15:57
today. And can you talk a bit about the actual tech Typical implementation of that scale and how its architected, and some of the ways that it has evolved since you first began working on it, particularly given the shift in where people are spending a lot of their time and focus on one, how they're treating their data and storing it into the types of platforms that they want to be able to integrate with. Absolutely.
Matthew Baird
0:16:20
So we, we started the company six years ago, as of September. And when we sat down, we talked about, like, what would be the stack that we would we would use clearly on the front end, you're like, you know, JavaScript, HTML, CSS, that's kind of that's, you don't even need a decision there. That's just the way it works. Whether you use backbone or react, you know, wait a couple weeks in the JavaScript stack will change. So we just picked the best thing that we could and we architected the front end really, really well to adapt to that in the middle tier, we decided that what we would build would be an API server, and we did not want to hire a whole bunch of engineers to work on the middle tier. We want to dabble Hi leverage language that we could use. And I had been working with go for for, you know, probably six months beforehand. And I was very encouraged by how you could use go, you could go away, you could come back, and you could look at the code. And literally, there was no friction for just stepping into that code and maintaining it. That's been, I'd say, an incredibly successful solution. And I know that, you know, there's never any language that universally everybody loves. But in terms of having a very small number of people maintaining a large code base, and being able to hire on to that code base and maintain it, you know, almost in a part time way, it's been super successful. The performance has been fantastic. And honestly, you know, in the enterprise, some people said, Oh, if it's not on the JVM, nobody's ever gonna let you run it. Not true. It's it's been fine. That's been adopted there. That's for the middle tier and the API services. And then on the back end, I worked with one of my co founders, Sarah Berwick, who is a huge fan and A leader in the functional programming space. And while I think maybe if we lifted all the restrictions, she might have gone with something more much more exotic. But the choice was it's got to be on the JVM, it's got to be something that works well with what the Big Data Tools are familiar with. So at that point in time, it was you know, the big thing was spark was starting to take off Hadoop was definitely a thing. And we really wanted it to be functional, so that we could deal with more naturally with with data in in a scalable way. So we chose Scala, which runs on the JVM, of course, and it's been a good experience. The language is highly leveraged. It's a not a, you don't have to have a huge team of experts. You gotta have some smart folks that understand the concepts. And I think honestly, it's it's been good for attracting the sorts of engineers that we like that want to do interesting, tough problems. I know that there's play for Scala and everything else, but I don't I think if you're probably if you're using Scala to build web applications, You're missing the point. It's much more powerful as this back end language for doing heavy algorithmic work, which is what we're doing. So from an architecture from a language perspective, that's what we're doing. From an architecture perspective. You know, the front end now is all in react. We got rid of the a lot of the backbone stuff we've been, you know, you always, as I said, are progressing that JavaScript stack. What we built on the front end is a Google Docs, like multi user concurrent application for designing data models and data services. And that we used web sockets and we used a sub protocol, HTTP call the web application messaging protocol. To do this pub sub type thing worked out really nicely. I'm happy the architecture scaled, its low bandwidth. You know, fundamentally, what we're doing is we're currently working on a thing that isn't a check in check out or a merge. It's not you don't have merges and forks and everything right in the Face of the users, they're not used to that. So while we do support things like version, the arc, the artifacts that we're building and get you don't, you aren't ever presented with a get merge that you have to go through. It's, it's, it's uses MBC, multi version concurrency to, you know, order the messages and then, and then bring them together and that architecture. So that's on the front end, the middle tier, pretty straightforward, restful interfaces JWT for security cores that we can talk to. So the the front end can talk to both the engine which is running in Scala, and the middle tier, which is running and go. And then on the back end, what we built, you know, fundamentally, we built an optimizer. It's like a database optimizer. And it understands more than nature, the data that's distributed across multiple infrastructures, and every piece of technology in that chain, whether its network, and in fact, it transcends technology and it has concepts around cost because When you are trying to optimize a query across multiple databases, it's no longer just that you have the optimization objective be pure speed, you want to create a scenario where you can optimize for other outcomes, and then blend them together so that whoever is running this data infrastructure can get what they need out of it. And whether that's pure speed, you can do that at the expense of cost, right? I can have elastic compute resources, I can scale things up. But you have to have that not that nature of is this data co located? Is it going over a low latency? network? Is it is it going to cost me an arm and a leg to serve this out of a Big Query or one of the per resource basis.
0:21:43
So a little bit more about it. The Design Center builds the virtual schema, the engine runs them. It's sort of a it is a very much a software project derived metaphor. You build, you deploy, and you test and you and you run it. That's for for these data services. Is that we're dynamically generating? We used to be on Hadoop, as I said, used to run Hadoop node. Now we have no tied to Hadoop at all. It's we run on any image, any VM, any cloud. We're not tied to Hadoop, we can talk to every cloud database out there. Yeah. And that's, is there anything else in the on the architecture side that might be interesting to your listeners? know, I think that's a good overview. And I'm also interested in your mentioning that at scale is built to be able to interface with multiple different data warehouses. And you said that it also integrates with Hadoop. And I'm curious what you have seen, particularly in recent years, as far as what the breakdown is of people leaning towards data warehouses or a data lakes and particularly with the advent of cloud data warehouses that are starting to blur those lines, that that's a really good insight. I think that we're starting to see. You know, I don't think the data lake definition is nearly as crisp and concise to say What a EDW definition is. And I do think that the cloud data warehouses have blurred that even more, you know, typically we thought I have a data lake as I guess traditionally would be, this is where the data lands. And it's unfiltered and it's raw and it's ready for doing whatever you're going to do to consume on it. But typically that involved doing some ETL, pushing it into a different database or different technology.
0:23:26
And then EDW is where the filtered curated ready to go, this is blessed. Let's Let's go habit. The nature of data is changed in in in the enterprise people are, I'd say the forward thinking companies are more interested in getting that data in a perhaps less filtered, they'd love to have it more filtered and more concise. But they think that the speed at which they can get the data to the users that they and the agility that those users can consume the data is more important than necessarily providing this beautifully manicured data experience. And you see that in the spaces of data, machine learning and things like that. But those cloud data warehouses have the capability and have a cost model that's not prohibitive for storing tons and tons of data. So they can serve both purposes. And then that means that they can serve all the different constituents and all the different consumer consumption patterns. As I said, whether it's business analytics, operational databases, KPIs, ad hoc machine learning, or whatever other things are coming along along that path. I think I said a bit. Is there anything in there that's interesting that you'd like me to drill in on? I think that's good. And for somebody who's using at scale, particularly if they already have an existing set of infrastructure, or even for people who are coming from Greenfield, I'm wondering if you can talk through some of the workflow of getting at scale setup and integrated with their systems. And then for somebody who's using at scale for doing analytics, what that looks like. So we have a combination we serve the global 2000s you know, people with lots of data A lot, probably lots of databases. Some of them are newer companies like wayfarer, I'd say is one of the newer companies that come up in the last five years. And they have a very forward thinking attitude towards data. And I, I think if you can go and read about what the guys that were saying about data publicly, they're doing a lot of things really, really smart, really pragmatic. And while it is it is new school and cutting edge, it does represent where I think that the markets going. And that is, let's give them one place to consume data. Let's give them tools to find the data. And let's give them an infrastructure that makes it so that when they start to use that data, whether at a small scale or large scale, we can grow with the consumption of that data and make it work for you know, they have thousands of people consuming data and their attitude is everybody should have access to all data, except in the case where it needs to be governed or there's a compliance issue, in which case, they apply that golf policy at that single data service entry point. So For them, to answer your question, I just my Chief Product officer always says, when you when somebody asked me what time it is a tom how to build a watch. I feel like for engineering, you know, the implicit question is always How does this work and where did all the edge cases and contextualize it? So for those sorts of sorts of forward thinking organizations where data is considered to be like Aaron should be available to everybody. It's a install it to single RPM install, you model some data. So you you register your data warehouses, you model, you build a virtual data warehouse, you deploy it, and you just start querying. And that's it. It's very, very easy. Now you have to have access to the data warehouse, which some companies, you know, whether if you're a Bank of America, or JP EMC, God bless you for having controls on that because I'd be worried if you didn't. So you have to get access to those data warehouses, and in some cases, that has to be secure, and curb Rose is challenging. It's probably always going to be challenging. It's the toughest system to work with. But it's what a lot of enterprises you see, you got to go through that part of the scenario. And then getting better at using at scale is about being able to iterate and put something in the hand, if you're an IT group or you're a data, sort of producer type person, you build something you give it to people to consume, you gather the feedback, you iterate. And we're really good at that. Because once you're in the role of serving people that consume data that you find out is there always changing their mind on what they want, and when they when they wanted is now so being able to model something and hit a button and deploy it and it's, the service instantly flips over and represents those changes is hugely powerful. Because in the past, those consumers of data have been like, Okay, I got to fill out a JIRA ticket. I wait, you know, anywhere from a week to a month to hear back on the status of the ticket, then a software project happens. And then I get the new data element because they had to build a data pipeline. That is Retail, blah, blah, blah all the way down the line, they don't want that. And I would argue the data engineering folks don't want to be in that space either. There's a lot of interesting data engineering tasks out there. And not a lot of them use the three letter word ETL. There's a lot more interesting things to do. So we handle that for them from an analytics perspective so that they can go in and that workflow for implementing add skills, very simple. And the iteration process and the maintenance ongoing is very easy. And
Tobias Macey
0:28:29
what are some of the main challenges that people are facing in their sort of traditional quote unquote, data engineering workflows, particularly if they've got a strong focus on ATL that they can just hand off to at scale once they implement it? And what are some of the ways that you have seen people use their time that gets freed up by virtue of not having to do those types of tasks anymore?
Matthew Baird
0:28:55
So think of this like, you remember when sales first came out, when I forget when but their motto was no software. There was software. And when we say there's no ATL, there's still ATL, it's a matter of what part is automated versus what part falls outside of the purview of, you know, the analytics use case, you still have to be able to write data pipelines and whatever tool you end up using, whether it's open source or commercial for landing the data, you still have to do the work of potentially doing some data wrangling or some data profiling if you need to improve or augment that data in a significant way. But once you've got it, to the point at which you think that there's a this is good, valuable data set and potentially lives with a bunch of other data sets, and in a warehouse, where we can help is the users can self serve on and we have customers actually doing this, I didn't actually expect that there would be a user that are a customer that would roll out our design tools to users so that they could design and build their own data warehouses and have them be production eyes, all without involving it at all. But if you think of it, it's kind of almost like what mule soft does for API's. But we do it for data. So we make it easy to do it. We give them some some automation capabilities. And then we roll it out. I think.
0:30:27
I went off and I think I've lost track of what the core The question was Tobias. I'm really sorry.
Tobias Macey
0:30:31
I was just trying to get an idea of some of the ways that people are taking advantage of the extra time that they get that is freed up by virtue of not having to do as much ATL once they start using that scale.
Matthew Baird
0:30:44
Ah, alright. Well, that depends. I think on the organization, I think that you would be best served to have your data engineering team, working on building out what some people call the real time enterprise or streaming and figuring out How to Improve the latency and the quality of the data as it pertains to collection in all the ingest stuff, and then potentially, I think that has much higher yield because garbage in garbage out faster data, you know, faster is better. The streaming use cases, nobody's really figured that out yet, at a sort of an enterprise scale. So that's a great place to spend your data engineering time. There are, you know, I read this statistic somewhere in I wish I could remember where so I could attribute it but I went back and checked and it's true. There are 7500 job openings for data engineers in San Francisco. There are 7400 people with the title of a data engineer in LinkedIn for the US. So we have more demand for data engineering skills in San Francisco, then we have supply in the US I'm sure they're going to find something to use those data engineers to do if they don't have to go and get this data element, make it available in Tableau for, you know, for the marketing group that's, that should be automated, I think we can all agree, things that should be automated, or can be automated, probably should be automated. That's an Fs that we really believe in. And it's not about taking away jobs at data engineering. In fact, it's about making data engineers much more happy in the work that they do. Look at what they're doing. If you join data engineering teams that at Facebook or or, or Amazon or Google or any of the big cloud vendors. Those are really interesting challenges. Making a data element available for an end user is super high value from a business perspective, but not a fun engineering challenge. And in terms of challenges that you have been faced with and the process of building growing the at scale platform, I'm curious what have been some of the most interesting or unexpected ones you've had to overcome. And some of the most useful lessons that you've learned in the process. Hire, I think hiring it finding the right people to work on. The types of problems that we work on that are extremely algorithmic in nature. We every single thing needs to be scalable to multi petabytes, you know, on the engine side of things, which is the Scala based software that we develop, it's, it's all about getting the right person in there and then getting them up to speed on, you know, essentially how databases are built. Even though we're not a database. We're kind of doing database like development, but even harder, because we have to support all the databases out there.
0:33:50
I think that the heuristics this there's unsolvable problems that we have to approximate with kind of like heuristics and identification of scenarios. So let's take an example of a kind of problem that we solve that's turns out to be really hard. Doing pre aggregating. And doing a pre process step on multiple petabytes of data takes a really long time. And if you want to compute every possible combination that somebody could, could, and every join path, it'll blow out your data massively, right? So think of it as this is the old school of how OLAP worked, we pre calculate all the common metrics, and we'd roll it out my co founder who was at Yahoo, had a two week build time, some of our customers at multi days and in today's enterprise, and today's you know, consumption of data patterns, like days, you're already dead. You can't You can't be looking at two days ago when you're running flash sales or, you know, you have you're spending a million dollars a day on some marketing program, you need to have up to date data, so that wasn't an option but also not doing anything means that as soon as people start to query the data, you have no information and you have no acceleration. So the queries take minutes hours, potentially. And there's a fine line where you're doing something that's smart, that's predictive. But you're doing it before you get all those traditional signals that you get from having the wealth of queries coming in, like queries are the natural expression of of interest from the end user community, right? Like, they'll tell you what they want if you go and gather requirements and talk to them face to face, but what they really want is represented by how they query the data. So we have to make this decision. And this is this was a major challenge and building up things that I think have never been built before in figuring out when do you have enough data? And how do you deal with a small amount of data but still give a good user experience to people query? That's been probably the second biggest challenge after hiring. I mean, of course, you need those people that are smart enough to do that type of engineering. Yeah. I think those two those two are things that I think about from a technical perspective, from a business perspective, I think virtualization has been in the past traditional virtualization has has failed. It's it's kind of it's become a word that's associated with a specific implementation, a traditional implementation. And we've moved past that. And you're a specialist in data engineering, you know, that, like, what, 15 years ago, there wasn't really a when did data engineering became a term that we use and it became a career, how long is that
Tobias Macey
0:36:32
a decade of I think it's within the last five or six years that it's really become a discipline in and of itself. I mean, it's a set of responsibilities that have been around since we started dealing with computers. But in terms of an actual distinct role. It's basically it postdates the the introduction of data science as the sort of new and interesting career path because then people were all companies were spending all this money on data scientists and realizing that the majority of their time was actually spent doing all of this, you know, collection and preparation and cleaning work that wasn't actually providing the end value that they were expecting. And so that's when they started breaking it out into its own discipline.
Matthew Baird
0:37:11
I see you've done this for real. Yes, data engineering enables anything that has to do with data in your in your company. And those folks like it didn't exist. When you put a name and you create a career out something, it has a way of sort of forcing the codification the definition of what all those things are, you know, like you said, we were doing some of those things. In some cases, it was tribal knowledge, it was best, but it was split across multiple different types of people in the organization. But once we created a career path throughout data engineering, things started to crisp up. And that's what enabled us to look at it and say, these are the things that are going to require humans, these are the things that probably a machines going to end up doing that's inevitable, and then, you know, going through that process of figuring out how to identify those scenarios and build them out. That was, you know, sir probably put that in third place for the challenge of building a building the company. And that represents a business challenge as well. I think, as I said, traditional virtualization is based on predates data engineering, and even some companies that came along that maybe didn't start in the same way that we did. And look at the problem. The same way went back to virtualization is being a federated querying caching issue and not an automation of data engineering tasks problem. So educating the market and getting getting them to understand that, you know, I know you've heard this term, but it doesn't mean what you think it means. And it doesn't carry the baggage that you think it means. It has been a big challenge. Not a big challenge. I mean, people see it, we show them, they get it, but I think from the perspective of sometimes language gets overloaded, that's, that's challenging. So we've experimented with creating new terminologies for this, but at the end of the day, people like sometimes they actually like the comfort of hearing language that they've heard before. What are some of the most interesting or unexpected ways that you've seen people Using at scale and some of the misconceptions that they might have going into it, that they are pleasantly surprised by when they discover some of the true capabilities of the platform that you've built. So as I said, we focus on performance. And performance is a term that means a lot of things like can a single query come back in a reasonable timeframe? Can you scale queries to whatever the size of the consumption pattern is, you know, so handling concurrency handling scale handling per query performance is what performance is about security, agility and cost savings? And I think the two places where, you know, security is sort of table stakes, the areas where people are surprised, are how do we drive performance in a way that's completely hands off? How easy is it to build and deploy these data services? I literally I think I drive the sales people in my company crazy because all I'll go completely off road and we can build a model and deploy a model in a in a meeting and not do the Like a pre canned demo, and then the cost savings, the cost savings are something that we actually probably didn't really think about as much because you typically don't lead with with that sort of like costs. It's a, it's a race to the bottom and traditional enterprise. But in cloud, it's been, it's been a big deal. And we see some very large reductions in cost. And the surprising thing there in the cloud is I don't think the cloud vendors mind that we say we're going to save you money on a big query, we're going to save you money on Amazon redshift, we're going to save you money on a snowflake, they're looking at the longer game of if I can improve the unit economics of analysis and make this platform more cost effective. From an ROI perspective. You don't constrain what you're doing and say great, I came in under budget, you maximize the use of that technology and you create more use cases you create more opportunities for people to come and use the data because you know that you have a scalable cost model All around it. So guys like Google and guys, like Amazon have been, you know, very, they've been open about it. And you know, we take a company like Home Depot is a wonderful customer of ours. And we did very specific things in our implementation for Big Query that solve the I mean, you can write a query that costs a lot of money, and you can express that query a different way. And it costs significantly less, say a join, which includes, which increases the complexity of the query in the slot utilization and Big Query versus enlisting it or doing another strategy, we can do all that stuff in an automated fashion with query rewrite. And we get them to the point where initially they thought they'd have maybe a couple hundred people on the all you can eat 2000 slot program, but really, now they're able to get that into the multiple, like five 6000 people consuming and getting the same exact answers at the same price. But they get that in an automated fashion from a use case scenario. There's three that come to mind. That are that were surprising to me. And maybe this is more because I'm a technologist and not necessarily business person, but the app he gave the design center, they thought that they use the Design Center was so usable in such a great experience that they gave it to their analysts and said here, here's the data. Here's the tool, design your own use cases, roll them out aquarium, you are entirely in charge of a production level self service initiative. That was you know, I I'm very proud of the work that we did to make a high user experience product, but I didn't think that it would be one where they give it to people that not weren't necessarily data modelers, the Home Depot, we talked about them already, but they have a massive spreadsheet that's super important for the business goes all the way to the top of the CEO, and we know everybody disparages Excel. I don't, I think it's fantastic. It's a two it's probably responsible for creating more programmers than any language in the world. Specifically, everybody's a little programmers when they when they get an Excel and that tool is so important that you know all this stores have it. We were creating a use case that that sort of suppliers and internal folks, fantastic use case but the scope of exposure of that data to people across the extended Home Depot family was amazing. And I do like the Wii I really like the visa use case. It's not surprising, it's actually more satisfying me because the reason that I started the company was to build an addressable programmable platform that wasn't necessarily just going to be people on tableau, consuming it, it's it was about people building businesses and building applications on top of it. And that's what they did. And they built a I think there's multiple hundreds of thousands of their customers access trillions of rows of data on a giant probably, I mean, probably the best and biggest use of Hadoop in the world that I know of. It's that it's a completely valid use of Hadoop. I know it's popular to hit Hadoop, but it's wonderful for them, they have to keep the data behind the firewall and they have so much of it that really, nothing else will work. So that was really interesting. My co founder, as I said, is a more of a pure bi guy. So you know, having people connect Excel and Tableau and Power BI and query all days is fascinating. But I wanted to enable the developers that have had to just gyrate wildly to get software built, that accesses large amounts of data. Now they can do it, I'd say relatively straightforward with the combination of virtualization and, and the automated data engineering that we do.
Tobias Macey
0:44:37
And by virtue of the fact that they're accessing the underlying data through this abstraction layer that handles some of the optimization aspects. It also helps in terms of future proofing and reducing some of the risk of experimenting with new tools and platforms because you don't have to re implement any of the client side code. You just you can just add in new data store test out to see if it does what you want. And if it has the sort of cost and performance patterns that you're looking for. And then if it doesn't work, you can just take it out again, without having to do a whole bunch of retooling and reengineering of the rest of your stack.
Matthew Baird
0:45:15
This guy programs, you know what it's like you, you, you this, the first thing you do if you if you don't know what the implementation is going to be on the back end, you create an abstraction in Java, it's an interface. Go Lang has, you know, more of a duck based contract system, you you build that abstraction, you test it out to make sure it works. And then you're given the freedom to change the implementation without having all the client code have to change. And in this case, client code is, you know, quite frankly, it's humans. Humans are the hardest code to change. They, they get stuck in their ways. They figure out a way to do something and they want to do it that way forever. So giving that freedom to the it and the data engineering team. I think you nailed it. That is life changing, you want to move a single table, you want to move all the tables, you want to go from on prem to off Prem, we've effectively created a distraction that gives you a software switch for controlling where the data is and how it's accessed. Which gives freedom.
Tobias Macey
0:46:16
And when is the scale platform the wrong choice?
Matthew Baird
0:46:19
Well, we're not in OLTP platform, we don't handle what I would call, you know, sort of like every bit of the sequel spec, we are in, we are a multi dimensional analytics platform. So for analytics use cases, it's good. If you want to talk about creating a data service that does sort of both sides. Right now we're not that that product more focused in on should you be a customer not we deal with, you may have a large amount of data, like you may have big data. In a single store, you may have big data in aggregate, and we find that much more reasonable is that every company's got big data. It's just in 50 to 100 To 1000 databases. But if you have a single data warehouse and you have small data, you know, that's probably just throw Tableau at it or you know, write a write a web app on top of it. And that's pretty straightforward. You don't need to buy at scale at that point.
Tobias Macey
0:47:15
And what do you have planned for future iterations of the scale platform and business either in terms of feature improvements or new product areas,
Matthew Baird
0:47:26
we're going to keep it simple and focus on performance, security, agility, and cost. And within that space, build a platform that solves for what I refer to as a governed self self service environment. And that means that users need to be able to discover the data. So that's cataloguing meta data management. They need to be able to apply policy in a centralized place so they can decentralize access. And that concept is really powerful when you think about it in that presupposes that you have sort of a natural path today to that goes through one place where policy can be enforced. And let's see, you know, we are focused now on on global 2008. And we implemented a lot of things. And this this really, you know, you asked a little while ago about open source and replacing open source. One thing where open source, and I love it, but they don't know all the situations that the around security that happened in an enterprise. So we're we've built out a lot of features and functionality to have the best security story of any company out there. And we owned big data, or we own big data for the global 2000. We got to go into the mid market, we got to focus on mid market and building products and services that are going to help those companies because they had big data problems and they have data everywhere problems, just like the big guys do.
Tobias Macey
0:48:48
Are there any other aspects of the scale platform and the work that you're doing there or the ideas around data virtualization or data engineering automation that we didn't discuss yet that you'd like to cover before? close out the show.
Matthew Baird
0:49:00
She's You know, I think there is the industry's changing, you know, the whole divergence and convergence model when you're solving problems. As I apply that to industries, we are still diverging and the data engineering is growing, it's still being defined that we haven't reached the apex and started to bring that together to make a very concise, these are the technologies. These are the activities, these are the kinds of people that are involved in it. So it's, it's exciting, but it's also it's a big challenge. We're going to have to keep up to date with what the best practices are and translate those into what our product does. Should we be driving tools like beam or any, you know, any of those data movement tools? Probably Absolutely. She's streaming going to become a bigger issue. I 100% believe streaming is going to be a challenge for enabling the kind of Consumption of data that enterprises want over the next decade. And frankly, the whole tool chain and and is not ready for it. Even if you look at things like the traditional BI tools, they don't have any way to really work with streaming tool, streaming data. There are point solutions here and there and some people have gone out and started startups to do it. But for the majority of consumption, that pipeline of ingest through the through to business analyst, it doesn't exist. So we're gonna have to see those, those areas change. We're gonna have to keep up to date with that. I think that so that's streaming is you know, I'd be a bad CTO if I didn't say that the ML phrase, but I do think this is actually a space where machine learning and advanced statistics are, are going to be a major improvement. We like to think that works. realization gives us. It's not just virtualization across multiple data warehouses, by the way, it's, you virtualize the column itself. So it's not necessarily a nominal value. It's a computed value. You have databases that now support pushing machine learning down to the data. And there's ways to do it. But we have to expose that to the end users in a way where they don't have to be mathematicians. But they can get the benefit of having of having that sort of experience with data where it's more helpful. And I hope it's not going to turn out like and I'm going to date myself now like the little Clippy the remember Clippy. Oh, I remember Clippy, I think everybody who has ever experienced it remembers it. Yeah, it. They were when you think about it, Microsoft was way ahead of the game on building a digital assistant. The problem was that everybody hated him. He's a nice guy. He tried. But I think that those sorts of things are becoming more pervasive in a way that it has to reach people that are doing analytics Companies because otherwise they're going to be limited in what the amount of analysis that they can do is it's it's creeping in with things like NLP. But there's other places where it would be super easy just to do a linear regression and show you when you're looking at a bucket of data that there's an outlier in there. I don't think anybody's really doing stuff. That's that even that easy right now. But we'll get there. I learned this. I just, I had a pixel for ever since or not a pixel, I had an Android ever since cell phones came out. And I transferred over to Apple. And while the Apple devices is beautiful, and I love it, I never realized how much the machine learning from Google in that device was helping me with my day to day and I'm a technologist and I, you know, I was obviously aware of it because I was using it every day. But it didn't strike me as being this game changer because it's, it's being evolved into the workflows we're becoming more and more used to it, the predictive stuff not just on the text and the speech recognition but in the application. plugging it into my car, and it's anticipating where I'm going to go, those sorts of things. It's just it's a very natural flow that has to happen. And we have to be there as a company as a add scale to enable that, that flow of information and creating the intellect from the data.
Tobias Macey
0:53:17
All right? Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
Matthew Baird
0:53:32
I'm super biased, Tobias. But
0:53:37
I do believe that the virtual virtualization model can work and it will be the way that people build these, these single data services and will be the way that people get broad adoption of a secure governance, self service analytics solution for all use cases. And that's that gap. I think we're in the lead. I think we have a new approach. I don't think we're we're not there yet. But we're the furthest along and and I think we have the best vision for doing it. Because then that has to be automated because the challenge is really that the gap is that this is hiring scale data engineers are going to it's going to be like the hottest, continue to be the hottest, and probably Highest Paid software career for. I just don't see an end to that.
Tobias Macey
0:54:32
All right. Well, thank you very much for taking the time today to join me and discuss your work on the at scale platform. That's definitely an interesting piece of technology and one that solves a necessary evil in the data management space. So thank you for all of your work on that front. And I hope you enjoy the rest of your day.
Matthew Baird
0:54:48
Thank you, Tobias. Enjoyed this.
Tobias Macey
0:54:55
Listening, don't forget to check out our other show. podcast.in it at Python podcast.com to learn about the Python language its community in the innovative ways it is being used. visit the site at data engineering podcasts. com to subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects from the show, then tell us about it. Email hosts at data engineering podcast com with your story and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!
Escaping Analysis Paralysis For Your Data Platform With Data Virtualization 1