SnowflakeDB: The Data Warehouse Built For The Cloud - Episode 110

Summary

Data warehouses have gone through many transformations, from standard relational databases on powerful hardware, to column oriented storage engines, to the current generation of cloud-native analytical engines. SnowflakeDB has been leading the charge to take advantage of cloud services that simplify the separation of compute and storage. In this episode Kent Graziano, chief technical evangelist for SnowflakeDB, explains how it is differentiated from other managed platforms and traditional data warehouse engines, the features that allow you to scale your usage dynamically, and how it allows for a shift in your workflow from ETL to ELT. If you are evaluating your options for building or migrating a data platform, then this is definitely worth a listen.

Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media and the Python Software Foundation. Upcoming events include the Software Architecture Conference in NYC and PyCOn US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Kent Graziano about SnowflakeDB, the cloud-native data warehouse

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what SnowflakeDB is for anyone who isn’t familiar with it?
    • How does it compare to the other available platforms for data warehousing?
    • How does it differ from traditional data warehouses?
      • How does the performance and flexibility affect the data modeling requirements?
  • Snowflake is one of the data stores that is enabling the shift from an ETL to an ELT workflow. What are the features that allow for that approach and what are some of the challenges that it introduces?
  • Can you describe how the platform is architected and some of the ways that it has evolved as it has grown in popularity?
    • What are some of the current limitations that you are struggling with?
  • For someone getting started with Snowflake what is involved with loading data into the platform?
    • What is their workflow for allocating and scaling compute capacity and running anlyses?
  • One of the interesting features enabled by your architecture is data sharing. What are some of the most interesting or unexpected uses of that capability that you have seen?
  • What are some other features or use cases for Snowflake that are not as well known or publicized which you think users should know about?
  • When is SnowflakeDB the wrong choice?
  • What are some of the plans for the future of SnowflakeDB?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:11
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at linode. With 200 gigabit private networking, scalable shared block storage and a 40 gigabit public network, you've got everything you need to run a fast, reliable and bulletproof data platform. And if you need global distributions, they've got that covered with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads. They just announced dedicated CPU instances. Go to data engineering podcast.com slash linode. That's l i n o d today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up Today with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, cranium, global intelligence, a luxio and data council could a data engineering podcast.com slash conferences to learn more about these and other events and take advantage of our partner discounts to save money when you register today? Your host is Tobias Macey, and today I'm interviewing Kent Graziano about snowflake dB, the cloud native data warehouse. So Ken, can you start by introducing yourself? Sure.
Kent Graziano
0:01:39
Thanks for having me, Tobias. I am the Chief Technical evangelist for snowflake I've been with the company for just a little over four years now. joined it when there was only about 100 people in the company. career wise, I have been in the data space for nearly 30 years. I've been Doing data warehousing for over 25 of co authored books with Bill Inman and Dan Linstead. And I've done everything from being a independent consultant to being the director of a data warehousing team for a public school system, as well as government contractor and worked in the healthcare space as well. So I've seen the ups and downs of the industry over the last couple of decades. And I'm still excited to be
Tobias Macey
0:02:33
and Do you remember how you first get involved in the area of data management?
Kent Graziano
0:02:37
Yeah, actually, I do. I was on a government contract in in Denver working at the Denver Federal Center with the Department of Interior and got into a job initially as a programmer and had the manager come to me and say, Hey, I just came back from this class at IBM on relational database design. Does that sound like anything that might be of interest to you? And then he proceeded to explain third normal form to me. And I was basically hooked at that point. And the project involved in a couple of months to me getting my hands on Oracle version 5.1. A, and installing it on a PC and building out the very first relational database system for that particular government agency. And which was simply combining three mailing lists that were being managed in a cobalt system into a single database. That that was set, as they say, the beginning of of the whole thing.
Tobias Macey
0:03:44
And so as you mentioned, now, you're the chief technical evangelist for snowflake. So I'm wondering if you can just start by explaining a bit about what the snowflake DB platform is for anybody who isn't familiar with it?
Kent Graziano
0:03:55
Sure. So snowflake is a cloud based data platform, it started off as really what I used to call a data warehouse as a service. So it was built in the cloud for the cloud, to allow people to load and analyze data at massive speeds, and taking advantage of the native architectures that are brought to us by by the cloud.
Tobias Macey
0:04:27
And I know that there are a lot of architectural changes that come about as a result of the fact that it's designed specifically for cloud native environments. So I'm wondering if you can compare it a bit with some of the other available platforms for a data warehouse saying and maybe highlight some of the differences in terms of the operational characteristics and usage characteristics from traditional data warehouses?
Kent Graziano
0:04:52
Sure. In the in the data warehousing world we over the last couple of decades have evolved from Basic relational database systems to our MPP massively parallel processing systems with either a shared nothing architecture, or a shared disk architecture. And what the founders of snowflake did is, because of the nature of the cloud and looking at the nature of the cloud versus a on premise sort of system, they invented a completely new architecture called the multi cluster shared data architecture, which separates the the compute power from the storage, so that we can independently in elastically scale, the storage and the compute in the previous architectures. Basically, if you wanted more storage, you had to get a bigger box and it had more compute for the most part, if you wanted more compute. Many times you were forced into buying again, a bigger box that had more storage that you might not need, and you were also found that you were generally bad. owned by the compute power of the box that you bought, say it's 16 CPUs. And well 16 CPUs may have been great when you started your data warehouse and you had 20 users and wanting to process. If you had the great opportunity to build a wildly successful platform, and you suddenly find yourself with hundreds or thousands of users, they're now bumping into each other. And they're having to compete for that shared resource. And with the architecture that snowflake has developed, you can actually create independent compute clusters for these different workloads, which allows us to really scale to a near infinite level for the number of users and number of processes that can be running. And at the same time, make sure that they are not contending with each other for limited resources, really allows people to be much more agile in the developed min of their ecosystem. As they convert, start very small and grow to a very large with out having to go out and purchase another system and do a migration. It's all dynamic and elastic on demand in the cloud.
Tobias Macey
0:07:16
And then there are a number of other competitors that are targeting the cloud environment, most notably things like BigQuery, or redshift, and recently redshift spectrum. I'm curious if you can give a bit of comparison to the characteristics of snowflake as it relates to some of the other cloud native implementations that people might be familiar with. And then also sort of on the more data lake side of things systems such as presto, or some of the sequel on Hadoop or sequel on unstructured data. Solutions. Sure,
Kent Graziano
0:07:48
yeah. Most of even some of the systems that are calling themselves cloud native, they are all based on our the previous architectures that I talked about whether it was shared Nothing or shared disk architecture. And while they may be in the cloud, for the most part, they are slight evolutions of the original architectures. And so they don't offer this dynamic scalability that under the control of the customer on demand, you Hey, you have things like query engines that can scale quite large. But in general, the customer doesn't have a lot of control over that, which means they can throw a really large job at it and constraints and just really do a great job. But they don't necessarily have the control over how much resources being thrown at that, which means that in the end, they they don't have a lot of control over how much that may cost them. And some of the other ones are really ports of the older architectures into the cloud. And so you're looking at things that are effectively VMs and the architecture is still the same where the the compute and the storage is tightly coupled. If you need more storage You got to buy more compute. If you need more compute, you gotta buy more storage now. And to be fair, many of our competitors are obviously evolving. And they're they're adding things to their, their offerings here. Over time, we're starting to see that but in the end, it ends up still being a fairly complicated engineering sort of feat for it for the folks who are managing the system. And that's one of the things that really differentiates us is the I'll say, lack of management that you need to do. So what a traditional database administrator may have had to do with performance tuning and backup and recovery, and things like that. We've taken care of all that for you, where in many of the other cloud based data warehouse systems, the DBA still has that more traditional role in is having to manage things like distribution, keys and vacuuming, as well as the sizing and forecasting of the utilization. Is that really does make a big difference. On the other side of the house, I start talking about data lakes and semi structured data. One of the things that we did in snowflake is we invented a new data type called variant that allows you to load semi structured data into a column in a table, and then access it natively with some minor extensions to SQL. So this saves people from having to parse it out and basically ETL the data into traditional tables and columns in comparison to things like presto and hive. And some of the other tools that are out there that evolved out of the Hadoop ecosystem where those really are SQL interfaces, if you will, on top of a file system. And so you do not have the some of the traditional database features such as transaction consistency and acid compliance. Those those aren't necessarily there, depending on which rev you're using. Again, those are all evolving. But what we've seen over the last 10 years is what started out as very sort of, I'll say minimalist sequel, because we discovered everybody really understands sequel. They don't understand necessarily Java and MapReduce that we had to build a sequel interface to make those Hadoop style systems work. And today with the cloud, some of its just Blob Storage, whether it's s3 on Amazon or Azure Blob Storage, we're still dealing with more or less a file system with a sequel interface built on top of it. So this does require a lot more care and feeding and puts limitations on you as to what you can do from a traditional database perspective, which we've got all built into into the snowflake platform already.
Tobias Macey
0:11:52
And one of the things that you mentioned in there is this variant datatype for being able to load semi structured data directly into snowflake And I know that that's one of the attributes that has led to the current shift from ETL to ELT, where you're doing the transformations within the data warehouse itself rather than up front. Because I know that traditionally there was a lot of planning and rigor that had to go into loading the data into the data warehouse to make it accessible for other people. And so I'm wondering how the performance characteristics and flexibility of snowflake and the availability of this variant datatype and some of the schematic flexibility plays into the overall data modeling requirements for using snowflake and some of the shifts towards this ELT
Kent Graziano
0:12:40
workflow. It's been very interesting for me because I come from a traditional relational background with third normal form modeling as well as dimensional modeling and data vault modeling in the data warehouse world. So looking at snowflake and the features that we have even just looking At the, the semi structured data, so whether it's JSON, Avro parkay, things like that, a lot of our customers have found the they can load that into a column. So effectively, you have a table now, where there's maybe a couple of metadata columns and a variant column. And then he single row in that table that in that variant, you'll have an entire JSON document that might have arrays and nested arrays and all sorts of really interesting structures within the we'll call it the schema of the JSON document. And our customers are able to build views on that using our JSON SQL syntax, and make it look like a dimensional model. And I found this incredibly fascinating and also incredibly powerful. In that, we're now able to do things like load web logs into a database table, but make it accessible to I'll say your average business analyst in a dimensional format. So they can actually do analytics on this type of data now, using Tableau using Power BI using tools that they're very familiar with. And they don't even know that it's JSON data under the covers. And so the combination of being able to load that type of data, the sequel extensions to access it, we've actually eliminated the T now in the ELT extract, load transform, the transform part is now a sequel view.
0:14:37
And it's transforming the structure into something that business analyst understands. But we're able to then load data much faster and really reduce the latency and the time to value for that data and the performance optimizations that are built into snowflake combined with our ability to do these independent compute clusters, which we call virtual warehouses allows us to get the performance out of that. So there there is there's literally no indexing, there's no manual partitioning, there's no distribution keys. And so DBAs are not having to spend, you know, I'll say nights and weekends trying to make make this stuff perform, we're able to just build views on it, explosive, the tableau, and the business analysts now have rapid access to all of this data. And in Additionally, we can allow them to join that semi structured data now to structured data that might have been loaded, say from a CRM system in a more traditional scheme, from a data modeling perspective. There's two factors here with snowflake one in the scenarios we're talking about right now with the semi structured data is we're doing almost virtual mock Right, the modeling approaches, whether it's three nf, data vault, or dimensional can all be done with fuse. So it's more about what is the right structure to represent the data in for the business use case. And that also means that you can have one set of data and have multiple representations of it serving different use cases. And this really gets us down to that concept of the single source of truth that we've been trying to reach in the data warehousing world for for several decades. The other aspect of modeling with snowflake is the system was designed to be what we call schema agnostic, where many of the traditional on premises database systems really preferred a specific modeling technique via third normal form or dimensional for the most part, and required specific today To make those things work, snowflake was designed to be performance against any modeling technique, because we wanted to ensure that wherever our customers were coming from in their legacy on prem world, they wouldn't necessarily have to change their approach to modeling in order to get effective use out of snowflake. And so that was actually designed from the ground up. One of the things that I didn't say earlier is our founders wrote snowflake, completely from the ground up, it was a it's a brand new relational database. So not only did they invent a new architecture, to take advantage of the elasticity, and the dynamic nature of the cloud, and to eliminate the concurrency and performance issues that people have had with the traditional systems, they were were able to do this only by starting from ground zero and not forking some previous code base. So it is all new code, resulting in this really dynamic. data platform that can be used for MS, we're talking about data lake types of applications, as well as traditional analytics applications, and B and C sequel compliant as well, at the same time.
Tobias Macey
0:18:12
And there seemed to be sort of two driving factors for when somebody might decide between using a data lake versus a data warehouse, one of which being the flexibility in terms of the types of data that you can load in and the availability for how to analyze it, which is in large part addressed by the flexibility in the schematic modeling that snowflake offers and the other one being cost. And so I'm curious what that looks like in terms of a breakdown for the overall spend, somebody would be dealing with, on particularly the storage side, but also to some extent the compute of storing everything in an s3 data lake or possibly even HDFS, versus what you'd be storing for the same volume of data in snowflake and Then the amount of time and compute energy required to be able to derive value from it, or the amount of data copying necessary, particularly in the data lake scenario of being able to then transform it into different forms for being able to run various analyses on it.
Kent Graziano
0:19:15
Yeah, you know, that's great question. One of the things with snowflake is that we are, you know, MPP style, right, so massively parallel processing, using the compute MPP compute nodes in the cloud. But on the storage side, we actually have implemented advanced columnar compression algorithm as well. So from a pure storage cost perspective, our customers pay basically the same price for storage as you would on the underlying cloud vendor. So in the case of AWS, we're talking about s3 storage. And because under the covers, that's what we're using, the differences is we are giving them Three to five x compression on that data. So from a pure data lake storage perspective, you can store that data in snowflake, and in all likelihood be paying way less, because of the compression that we have our terabyte per terabyte storage cost is effectively the same as Blob Storage, except that in our case, we're able to compress it. So you're going to get from three to five terabytes compressed into one terabyte of storage. So that's a huge cost savings for our customers. And that's why it's become feasible for them to consider putting their data lake in snowflake as well as their data warehouse and doing all the analytics. Now your your secondary question there is okay, if we put the data lake and we put all that data in snowflake, how much is it costing us? What does that look like to then transform that into analytic style schemas? And as a mentioned before, in many cases, if we're talking simply about the, you know, the semi structured data and even unstructured data, a lot of our customers are just using views. And so there's the cost there ends up being in the compute to execute the query, where they in the past, you may have had cost for doing actual ETL processing, where we do certainly have customers that are doing transformations within snowflake as well where it's necessary to apply some advanced business rules, some data cleansing rules, things like that. So you do have that as well. But that cost in the case of the of transformations, whether you're talking ETL or ELT is something that they were paying one way or another before already. The advantage again with snowflake is our customers are finding that our our system runs way faster than their their prior systems. I know one customer that had a report that ran for 20 20 hours on their data warehouse appliance. And by simply moving that data in the scheme over snowflake they ran in 45 minutes. And so they experienced that massive cost reduction in looking at the cost of compute on snowflake in comparison to what that massive data warehouse appliance cost them. And it is a different cost model for sure. I mean, it is a pay as you go model. In the case of snowflake, we charge a per second per node. So when you create these virtual clusters, we're talking about, you know, how many nodes are there? Is it is it a single node? Is it two nodes, four nodes, is it eight nodes, but we charge on a per second per node basis, and generally get linear scalability. So if somebody wants to run something faster, they use a larger virtual warehouse, and it'll run in half the time but because of the the pricing model, it costs Exactly the same amount of money, it's just that it ran faster. And so then we start talking about the, the value of the time, and what else they can do when they have access to that data so much faster.
Tobias Macey
0:23:13
And another thing that you mentioned a couple of times is the fact that because the ability to create these views on the underlying data structure is you in many cases can just have it living in one place, and you don't have to worry about copying it. And so I'm curious how that plays into the requirement for a Master Data Management approach for ensuring that there are canonical schemas are canonical IDs for users or sales products or things like that. And if that is still a necessary aspect of data modeling and data management in this snowflake world,
Kent Graziano
0:23:49
it is for certain use cases. And it really is about what the business requirements are. If you're talking about a lot of raw JSON Web blogs, bugs and the need to do data science on it and try to derive some, some insights from that data, then then they're pretty much good to go with, with what we've been describing. If you are looking at building else and enterprise data warehouse in a highly regulated industry that needs to be compliant with GDPR and ccpa, and some of the other regulations that are out there, then we do have our, I'll say, our enterprise customers, you know, the really big organizations are still building canonical schemas and doing more of the traditional transformation transformations and Master Data Management type approaches. I think the beauty of snowflake is that it allows you to do either or you're not locked in you have the flexibility based on the requirements of your business. And and and your approach and what you're what you're trying to achieve. So we do have people doing doing both for sure Sure that are porting over things that are what I call, you know, the the highly integrated enterprise data warehouse where it does require an integration layer using something like a data vault modeling methodology, where you will have stage tables, and you will have a core granular data warehouse that is aligning the business keys. And and making sure that when the reporting comes out that, you know, we we have a common agreed upon representation for a product or a customer. So all of that still does happen. And again, it just it just varies from from customer or customer as to what their business requirements are.
Tobias Macey
0:25:41
And so you mentioned that the high level view of how snowflake is implemented is that there's the underlying data storage in the Cloud object store and then you run the compute on top of that in a sort of virtualized way and in a way that gives you some segmentation In terms of who can access what, but I'm wondering if you can dig a bit more into how the snowflake platform is architected, and some of the ways that it has evolved over the years.
Kent Graziano
0:26:10
Sure. So again, the storage layer, as we were saying, under the covers, whether it's on AWS or Google, is we're using the Blob Storage native to that cloud. And we are writing our own proprietary file structures into that storage as a snowflake customer and user, you don't see it as s3 storage. You don't see any storage really, I mean, you look at a chart and see how much storage you're using, how many terabytes of data you're storing, but you can't access or see these files. It's all something that's that were built, we've built to that, you know, patented file format, and structure and metadata around that and that's, that's the storage under the covers and then The compute layer is then using the compute nodes. So in the case of Amazon, we're talking about EC two nodes under the covers, complete with their SSD. And that becomes our dynamic memory, if you will, along with the number of threads on the compute cluster that someone has allocated. That holding that all together is what we call the Global Services layer. And that's really the brains of snowflake. And that is where the metadata engine is that's tracking where all the files are tracking who the users and the roles all the role based access control, and the security is all tracked in there as to who has access to what compute resources, who has privileges to create new compute resources, who has privileges to create you tables and load data, all of the standard things that you would need in an in a database from a role based access control perspective, but it also the Global Services layer manages the transaction. consistency, the acid compliance and manages all the security from an encryption perspective, our underlying data is always corrupted. It's it's 256 bit encryption at motion. And at rest. So as you are uploading data into snowflake, that data is encrypted on the the client side before it comes over the wire and into the snowflake ecosystem. And then it is encrypted at rest. And we use a, a fairly robust, hierarchical encryption scheme to make sure this stuff is all secured complete with rotating keys because one of the biggest concerns in the cloud was always, you know, is our is our data secure. And I worked in many traditional database environments over my career where the data was never encrypted, and everyone was was counting on the firewall to to keep the data so care, when in fact people were able to get through those firewalls and read the underlying files and actually glean data out of those underlying files. snowflake was built to make sure that could never happen. And because of that we have we are HIPAA compliant, we're PCI compliant. We have SOC two type two, and SOC two type one certifications, along with another number of other security certifications to ensure that our customers data is always secure, that there's no way anyone can get to it. And so that's all combined there. So it's really the three layers the Global Services layer, which is the brains, your your compute layer, which is dynamic, where you can create as many of these independent compute clusters as you want. And then under. At the bottom, there is the common storage layer, allowing us to have one set of data and again, not having to copy it between silos and data marts. Have one set of data access by all these different compute nodes.
Tobias Macey
0:30:05
And my understanding from previous conversations that I've had, one of which was actually on this podcast, I understand that the actual metadata layer for snowflake is implemented on top of foundation dB, so that you actually have one location for being able to identify where all these files are located in the broader s3 storage rather than having to actually query the s3 API and deal with those latencies for being able to seek through information as you're trying to run a query. and wondering if you can talk to either that are any of the other performance optimizations that are built into the system to ensure that is operating at the sort of maximum performance that you're that you're looking for and to be able to provide that reduced time to value for your customers?
Kent Graziano
0:30:51
Yeah, and that that's absolutely correct, which were told before it is under the covers are, our metadata engine is based on F dB, and we we retain, I'll say, the addresses of all of the data in there so that when someone is querying the data, it goes in there and looks for, for the data, but depending on the nature of the query, the names of the tables, all of that's there, and has pointers directly to our underlying file system, to to find that data as quickly as possible. And in fact, we we keep the along along with all the pointers there, part of it is keeping the range of the data to know that you know, from A to Z of this particular column, you can find that in this set of files under the covers, so we have that sort of direct pointer there. So when people writing queries, you know, based on the predicates and the join conditions, we have a lot of metadata and formation to allow us to go directly to the files where that data is really stored. I mean, again, to a customer, it's a tape, right. But under the covers, those tables are made up with a lot of very small files, which we call micro partitions. And our one of our primary methods of performance is called partition pruning. And it's based on the ability of the system to look at the query, go to the metadata, and then go directly to the set of files that contain the data that this query is asking for. So we're not having to do what you know would have been called I'll say in the old database world, a full table scan, you know, obviously, if somebody to do that and they want to do an aggregation on a trillion row table, well, it's going to do a full table scan. But if you're doing something more of a dimensional analysis, and we want to see all the sales for product x for region, why for sales rep can't, we're going to be able to hone in on it. data set very quickly and only return that set, even if we're talking about multi terabyte tables, and even into the into the petabyte range.
Tobias Macey
0:33:10
And then in terms of the overall system architecture and the implementation details, I'm wondering if there are any sort of edge cases or limitations that you're dealing with or any of the specific challenges for being able to design and implement this across multiple different cloud vendors?
Kent Graziano
0:33:28
Oh, yeah, that's that that's interesting, that gets into the fact that we are we are cloud agnostic. And there hasn't, in the experience that I've had with snowflake so far, there doesn't seem to be anything that we can't do on one that we are that we can do on one that we can't do on another. And that was again, the The goal of the founders is they, they developed snowflake with this Global Services wrapper around it. And we're really writing to the API of the underlying cloud vendors. So that we can ensure that whether you choose AWS deployment or Azure deployment or a Google deployment, your snowflake experiences exactly the same and all three. And one of the things that is is coming, that that we just released earlier this year is the ability to not only do cross region replication within a cloud provider, but we now have the ability to cross cloud replication. So you could actually have the running snowflake on AWS and replicate it to snowflake on Azure as a as a mechanism. In some cases, some of our customers want to do that for disaster recovery and business continuity. In other cases, it may be for latency reasons. And in some cases, it may be for business reasons, were there certain data sets that for whatever reason, they only want it in the Azure space, and other data sets they may want need to have in AWS space for, you know, for various various business reasons. So we've we've done quite well, because the system was designed from the get go not to be single cloud, it was designed to be cloud agnostic. And so we have not had had these issues where, hey, you know, this feature works great in AWS, but it doesn't work in Azure.
Tobias Macey
0:35:24
And then for people who are interested in getting started with using snowflake and loading their data into it, what are some of the options that they have for being able to populate the data warehouse that they set up with their data and some of the overall workflow of then allocating and scaling the compute capacity for being able to run analyses and maybe some of the options for cost optimization to ensure that they don't leave those clusters running when they're not actually being used?
Kent Graziano
0:35:54
Yeah, I'll start with that one first, because that's the easiest one, we have when you create a virtual warehouse on snowflake we have an auto suspend and an auto resume feature, because it is a pay as you go. And what we've discovered over over the years is many workloads do not need to be running 20 473 65 In fact, in business, intelligence and dashboarding, all that, in many cases, it's you know, people come in fire query, they get the results back, and they're good for the day really. And so we have this auto suspend feature where the customer controls that with a parameter to say, if the compute is not being used for X number of seconds, minutes hours, up to them, it automatically shuts down. And then the auto resume feature, the next query that comes in to hit that cluster, it automatically turns it back on. And so that's that's a great cost control. And people are finding that you know, they streaming certainly a lot of money that way. That we're running we're we're, we build at a per second basis. So something runs for five seconds. And you know, he pay for five seconds, runs for five minutes, you pay for five minutes, it runs for an hour or you pay for an hour. One of the other scaling features is that we do have the ability to dynamically resize. So say you are running a use case that normally runs for an hour and you're running it on a small virtual warehouse, which is two nodes. And you need to get it done in 15 minutes. Well, then you you alter warehouse Resize to a large and now it's running on 16 nodes. And it's going to run in 15 minutes instead of an hour. And but it costs the same because it's per second per node. And so that's, that's a really cool feature that we have that that people do use on a fairly regular basis on the storage side, and you know how it How do we get our data into snowflake? And that's a, you know, obviously very common question, how do we get data into the cloud, if all of our data is in data centers, and we have a very open ecosystem of connectors, where we started off with, we have a native Python connector and ODBC connector, JDBC connector. We even have a, an optimized spark connector. People were writing their ETL using spark which many organizations have been doing.
0:38:34
It's allowed all the traditional say all the traditional ETL vendors and ELT vendors have connectors. And so if you're currently using SSIS, or talent, or Informatica, you can re point that to snowflake and continue to use the same flow that you're using today. Then, of course, there's there's new new ones like five Tran and metallian that are tools that are more born in the cloud themselves, you have an option to do those. In fact, we even have a, we have a partner Connect feature. There's a dashboard when you log into snowflake, where you can sign up for free trials for some of these tools. And I've heard from customers that are, you know, doing a PLC with us that, you know, they got access to there's new stuff like account they went in, and within 30 minutes, they were up and running, provisioned with one of our partner tools and loading data from their, their old database and snowflake. And so there's any number of options. You can use Kafka. You know, if you happen to be running in a Kafka environment, we just released our Kafka connector using using confluent and so people are able to run data from Kafka straight into snowflake. We have you know, the I'll say the traditional methods You can upload your data into in a blob storage bucket, and then copy it into a table and snowflake. We have our own ingestion engine called snow pipe that allows us to take data that is dumped into a blob storage bucket, and automatically pick that data up and loaded into tables and snowflake. And there's even a button on our web UI where you can click a button to load data in and specify the table you want to load the data in, specify the file you want to pull it from, and it could be pulling it off your laptop, and in more or less, you know, basically manually load it using our web UI. So there's there's a very large variety of options for customers to choose from, and how they're going to load their data into snowflake.
Tobias Macey
0:40:56
And another one of the interesting features that's enabled by you Architecture is the data sharing capabilities for being able to say that this data set is available to somebody else who's running a virtual data warehouse. So that opens up the possibility of things like data brokerages, or allowing a certain subset of data to be shared between different business units or organizations. And I'm wondering what you have seen as some of the most interesting or unexpected uses of that capability.
Kent Graziano
0:41:24
Yeah, yeah, that's, that's pretty much everybody's favorite feature these days is data sharing. That you know, a, any organization now can become a data provider, and effectively build what I refer to from my architectural framework as a curated data mart and grant access to that data mart to any other snowflake account holder, and it appears to them as a read only database. And that has evolved now into us having we now have the snowflake data exchange which is a Available in its public preview now on one of our regions, where customers have signed up and are actually making data available, like weather data, packet weather has, has a tile on there that you can sign up to, to get weather data and get access to that to augment your analytics in your data warehouse and snowflake. And then just this week, at AWS reinvent, we announced the the private data exchange that we're now actually going to be allowing customers to internally within their own organization rather than a public data exchange, to find these data sets into a much I'll say, user friendly interface, and specify other organizations within their company or potentially specific business partners that they want to allow access to this data in, in a very resilient manner and a very secure manner. The most, one of the Most interesting usages of this was very early on with one of our customers that had a partnership with another stuff like customer. And they provided analytics on YouTube videos. And so they were collecting all of all that sort of data and shared it over to the other snowflake customer who then did augmented analytics on it with more consumer related data that they had. And so they are effectively joining that data to the YouTube analytics and providing a different level of analytics, and then in turn sharing that data back to the original data provider. So effectively, the raw data is coming from one organization and sharing it to a partner organization who then augmented it with their own data to get a more sophisticated data set, and then sharing it back to the original provider. And then the really interesting thing about it was when the original consumer turned around and bought the provider. And it actually became one company. And the amount of money that they saved by not having to do manual transfers of this data was incredible. And the their, their return on investment was quite amazing to see because it eliminated. Even, you know, before the acquisition, they were dropping files into a secure FTP. And then the other organization has to have an ETL process to ingest it into their data warehouse. They were doing their analytics on it and augmenting it and then having to spit that data back out to flat files to a secure FTP site. And then the whole process goes over again. And there was that was quite fascinating to see that particular use case because I call it bi directional data sharing and it's It's really interesting to see what people can do with these features. And, and in many cases, it may be use cases that our founders might not have even thought of when they first started. But it was all a result of this separation of compute from storage, because the storage is centralized separate from the compute. That's what allows us to do data sharing, and to and to create this logical share container on the storage layer, and have it accessible by compute. That's in a completely different snowflake account.
Tobias Macey
0:45:35
And what are some of the other features that are, what are some of the other features or use cases for snowflake that are not as well known or as well publicized as this data sharing capability, which you think users should know about and would be able to benefit from?
Kent Graziano
0:45:49
Well, there's there's a couple that I think you know, the DBAs really love there's a feature called and drop that allows you to instantly recover Anything from a table to a schema to an entire database. So the classic Monday morning, oops, somebody comes in thinks they're logged into development, they're logged into production. And they drop a set of tables because they're about to do the next iteration. And then realize, wait a minute I was in production. They simply issue a command and drop table and drop schema, whatever it was, and it's instantaneously back. And that's one of the, I'll say it's a side effect of our feature called time travel, which allows you to query your data as it was at a previous point in time, and by default, all of our customers get 24 hours of time travel. So if you're running any tail process in the morning, and later in the afternoon, you discover that, hey, you know, there's something wrong with our process. They can very easily roll it back by you know, they can they can take a look at what it looked like before the process ran And you can go all the way up to 90 days with it. So it's there's literally no configuration other than setting the parameter. But because of that feature that allows people to do things like on drop, you know, if you've got a 90 day time travel window, you could actually recover something that you dropped at nine days ago, instantaneously. And that's, that's a really cool feature. One of the other ones, which I don't think people know a lot about is, we actually do resultset caching. So for when you run a query, and it produces a result set, that that data set is actually cached for 24 hours. And you can go into the history tab in the snowflake UI, and review the query that you ran, you know, 23 and a half hours ago, and see the data set and actually take a look at that data set without having to re execute the query. And so there's no cost involved in in that particular One. So from a from a development perspective, I found that very, very useful. Classic, you know, five o'clock I'm, I'm working my way through building out a particular sequel query, go home, come back in the next morning and trying to remember where you know where was I will instead of having to rerun the last query that I ran before I went home last night, I simply go back and look at the results cash and I can see the data set, I can see the query, and then I can continue on from there without having to have the cost of re executing the query the next morning. And so
Tobias Macey
0:48:34
with all of the different capabilities and flexibility and benefits that snowflake can provide, it's easy to assume that it's the right choice for every situation, but I'm wondering what you have seen as being some of the cases when snowflake is the wrong choice.
Kent Graziano
0:48:49
Well, for 111 of the things that people used to ask me when I first started because we always said, you know, we were a data warehouse is like, Well, that sounds like it's a database. And yes, snowflake is indeed A database. And now we say snowflake is really a fairly robust data platform. The one thing you absolutely would not want to do with snowflake really is an OLTP application where you're doing single tin inserts and queries because it is an MPP system. There's eight threads on our smallest virtual warehouse. And you're certainly going to be grossly under utilizing it if you're doing Singleton inserts. So any sort of really single row type of application is probably not not necessarily the best, you know, micro batches, continuous, continuous feeds, things like that are okay though, we are working as part of one of our enhancements working on the improving our ability for low latency type requests that you really wouldn't want to build like a corridor entry system on so Like even though you could with one of the you could use our Node JS connector and build a very fancy web UI order entry system from scratch, but that would just not be the right thing to do with snowflake.
Tobias Macey
0:50:12
And what are some of the plans for the future of snowflake DB either from the technical or business side?
Kent Graziano
0:50:18
Well, we the on the business, from a business side, the whole business continuity story, and global snowflake is very big, where we are moving forward with our cross cloud replication. And one of the things on the roadmap will be an automatic sort of failover that if one if your primary snowflake account fails for some reason, and the service goes down. telco the lines go out for a particular availability zone, that it will automatically fail over and automatically fail over the applications pointing to snowflake That that is on the roadmap. It's a thing we call organizations, they're going to allow you to have basically one organization with multiple snowflake accounts under it that are all very well connected. Another is our geospatial support. So we're going to be rolling out some, some support for geospatial data types, including geo JSON. And you can imagine there's certain industries that that's going to be, that's going to be a really big hit in. So I know that's, that's coming down the pike. And then with all the regulatory regulations, and concerns over privacy, we are going to be adding column level access controls including dynamic data masking, and that's going to be you know, that's a, a feature that's been requested, particularly in healthcare and financial industry space, to be able to do it not just at a rollout Well, but to do it at a column level, and to incorporate data masking, actually, data masking functions into into snowflake directly.
Tobias Macey
0:52:08
And that's one thing that we didn't really touch on yet is the specific sequel dialect that snowflake is using. I'm wondering how closely it's sticking to the ante standard. And what are some of the interesting extensions or built in functions that it provides?
Kent Graziano
0:52:25
Yeah, yeah. So it is it is an anti anti standard sequel. The extensions, I think I mentioned a little bit already with the extensions we've had for accessing keys within JSON and semi structured data as well as XML. And so there's some extensions there. We have a very extensive sequel function library, which people can check out on our documentation. You can just go to docs, snowflake calm, and look up the sequel library and so we have a full set of windowing fun functions and statistical standards, SQL, statistical functions, things of that nature that are already in there. We also have the ability to, for customers to write user defined functions. As well as recently we've added support for stored procedures. Kind of back to our LTE conversation a little earlier, we recently introduced a feature of streams and tasks that will take advantage of stored procedures to do some Change Data Capture within the snowflake engine. So as new data is loaded, you can you can create a stream object on top of the table to look at things that have changed and then launch a task to move the data further downstream. So we're really working on expanding true what I'll call true ELT capabilities within the snowflake engine itself. But all standards equal, you know, stand SQL functions to do the transformations and calculations that people have typically needed to do. And even to the point of doing some data science type calculations, one of our customers build a data science application completely using snowflake SQL, that the data scientist said everything he needed was, was in the library to do the statistical analysis that he was trying to achieve.
Tobias Macey
0:54:28
Are there any other aspects of the snowflake platform or the ways that it's being used or the use cases that it enables that we didn't discuss yet that you'd like to cover before we close out the show?
Kent Graziano
0:54:39
Wow. It's being used in pretty much every vertical. So we've got we've got folks that are using snowflake now for security analytics, and we have an open source system that we've put out called snow alert. And that's it. That's a really interesting use case where people are even using snowflake to re record Some of the functionality that they were previously using things like Splunk for marketing analytics, we have quite a few customers that are doing marketing analytics using snowflake. And then certainly healthcare as I've already mentioned, we have well over 100, healthcare related companies doing healthcare analytics. Some of companies are actually building analytic applications on top of snowflake. So they're providing an analytic service to their customers that's powered by snowflake under the covers. And we're starting to see I'll say more and more of that sort of thing. Kind of little, many cases in combination with data sharing. So you know, lots of lots of interesting applications in this whole sort of hybrid. Being able to do your data lake and your data warehouse all in one platform is a is certainly a growing use case. And with the addition of streams and tasks and our ELT capabilities, I expect to see that growing even more over the next couple of years.
Tobias Macey
0:56:03
And so for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
Kent Graziano
0:56:19
There's a lot of tooling out there. I mean, the biggest question I keep getting on is things like data catalogs. And for large enterprises. And there, there is some tooling out there for that we have a number of partner companies that specialize in data catalogs. And I see more of a knowledge and process gap honestly than a tooling gap is just people understanding the need for data governance, and some of the best practices around data management that are necessary to make effective use of all these really large data. a sense that people are starting to deal with. And it ends up being more of a process than a technology. Question right now, the The only thing I'll say on the snowflake side that we don't handle at this point is things like images and sound, you know, mp3 is, you know, closer to what people would think of now truly is unstructured data, though. In reality, they all have some amount of structure in it. So that that's, that's a little bit of a gap that I think we have right now that I know where we're looking at. And we have customers asking us about that about, you know, how how we can integrate some of some of the use cases around that as well.
Tobias Macey
0:57:41
Well, thank you very much for taking the time today to join me and share your expertise and experience of working with snowflake. It's definitely an interesting platform and one that has been gaining a huge amount of popularity and mindshare. So it's great to see the directions that it's going and some of the capabilities that it provides. So thank you for all of your time and I hope you enjoy the rest your day.
Kent Graziano
0:58:03
Thank you very much.
Tobias Macey
0:58:10
Listening. Don't forget to check out our other show podcast.in it at Python podcast.com to learn about the Python language, its community and the innovative ways it is being used. And visit the site at data engineering podcast. com to subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects from the show, then tell us about it. Email hosts at data engineering podcasts com with your story and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!