SnowflakeDB: The Data Warehouse Built For The Cloud

Hello, and welcome to the Data Engineering Podcast,

the show about modern data management.

When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it, so check out our friends over at Linode.

With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

And if you need global distributions, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

And for your machine learning workloads, they just announced dedicated CPU instances.

Go to data engineering podcast dotcom/linode,

that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

You listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, Corinium Global Intelligence, Alexio, and Data Council.

Go to data engineering podcast.com/conferences

to learn more about these and other events and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macy, and today I'm interviewing Kent Graziano

about Snowflake DB, the cloud native data warehouse. So, Kent, can you start by introducing yourself?

Sure. Thanks for having me, Tobias. I am, the chief technical evangelist for Snowflake. I've been with the company

for just a little over 4 years now.

Joined it when there was only about a 100 people in the company.

Career wise, I have been in the data space for nearly 30 years.

I've been,

doing data warehousing for over 25.

I've,

coauthored books with Bill Inman and Dan Linstead,

and I have done everything from being a

independent consultant

to being the director of a data warehousing team for a public school system

as well as government contractor

and, worked in the health care space as well. So I've

seen the, I'll say, ups and downs of the industry over the last couple of decades

and,

am still excited to be And do you remember how you first got involved in the area of data management?

Yeah. Actually, I do. I was on

a government

contract

in in Denver, working at the Denver Federal Center with the Department of Interior

and

got into a job initially as a programmer and had the manager come to me and say, hey. I just came back from this class at IBM on relational

database

design.

Does that sound like anything that might be of interest to you? And then he proceeded to explain

3rd normal form to me, and I was basically hooked at that point.

And the project

involved in a couple of months to

me getting my hands on

Oracle version

5.1

a

and installing it on a PC and building out the very first

relational

database

system for that particular government agency,

which was simply combining

3 mailing lists that were being managed in a COBOL system into a single database.

That that was, as they say, the the beginning of of the whole thing. And so as you mentioned now, you're the chief technical evangelist for Snowflake. So I'm wondering if you can just start by explaining a bit about what the Snowflake DB platform is for anybody who isn't familiar with it. Sure. So

Snowflake is a

cloud based

data platform.

It started off as,

really what I used to call a data warehouse as a service, so it was built in the cloud for the cloud,

to allow people to

load and analyze

data at massive speeds

and taking advantage of the native

architectures

that, are brought to us by by the cloud. And

I know that there are a lot of architectural changes that come about as a result of the fact that it's designed specifically for cloud native environments. I'm wondering if you can compare it a bit with some of the other available platforms for data warehousing and maybe highlight some of the differences in terms of the

operational characteristics and usage characteristics

from traditional data warehouses?

Sure.

In the in the data warehousing world, we, over the last couple of decades, have evolved from, you know, basic relational database systems to our MPP,

massively parallel

processing systems

with either a shared nothing architecture

or a shared disk architecture.

And what the founders of Snowflake did is

because of the nature of the cloud and looking at the nature of the cloud versus

a on premise

sort of system. They invented a completely new architecture called the multi cluster shared data architecture,

which separates the the compute power

from the storage

so that we can independently and elastically

scale

the storage and the compute.

In the previous

architectures,

basically, if you wanted more storage, you had to get a bigger box and it had more compute for the most part. If you wanted more compute, many times you were forced into buying again a bigger box that had more storage that you might not need. And you were also found that you were generally bound by the compute power of the box that you bought, Say it's 16 CPUs. And

while 16 CPUs may have been great when you started your data warehouse and you had 20 users and 1 ETL process,

if you had the the the great opportunity to build a wildly successful platform

and you suddenly found yourself with 100 or thousands of users,

they're now bumping into each other,

and they're having

to compete for that shared resource.

And with the architecture that Snowflake has developed,

you can actually

create independent

compute clusters for these different workloads,

which allows us to really scale

to a near infinite level for the number of users, the number of processes that can be running,

and at the same time, make sure that they are not contending with each other for limited resources.

Really allows people to be much more agile

in the development of their ecosystem,

as they they can start very small and grow to very large

without

having to go out and purchase another system and do a migration.

It's all dynamic and elastic

on demand in the cloud. And then there are a number of other competitors

that are targeting the cloud environment, most notably things like BigQuery or Red Shift, and recently Red Shift Spectrum. I'm curious if you can give a bit of comparison to the characteristics of Snowflake as it relates to some of the other cloud native

implementations that people might be familiar with. And then also sort of on the more data lake side of things,

systems such as Presto or some of the,

SQL on Hadoop or SQL on unstructured data, solutions?

Sure. Yeah.

Most of even some of the the systems that are calling themselves cloud native,

they are all based on

our the previous architectures that I talked about, whether it was shared nothing or shared disk architecture.

And while they may be in the cloud,

for the most part, they are

slight evolutions of the original architectures.

And so they don't offer this dynamic

scalability

at under the control of the customer on demand. You you have things like query engines

that can

scale quite large,

but in general, the customer doesn't have a lot of roll over that, which means they could throw a really large job at it and it can scrape and just really do a great job. But, they don't necessarily have the control over how much resource is being thrown at that, which means that in in the end, they they don't have a lot of control over how much that may cost them.

And some of the other ones are really ports of the the older architectures into the cloud. And so

you're looking at things that are effectively

VMs,

and the the architecture is still the same where the the compute and the storage is tightly coupled. If you need more storage, you gotta buy more compute. If you need more compute, you gotta buy more storage.

Now and to be fair, many of our competitors are obviously evolving,

and, they're they're adding things to their their offerings here over time. We're starting to see that, but in the end, it ends up still being a fairly

complicated

engineering

sort of feat for for the folks who are managing the system. And that's 1 of the things that really differentiates us is the, I'll say, lack of management

that you need to do. So what a traditional database administrator may have had to do with, performance tuning and backup and recovery

and things like that,

we've taken care of all that for you.

Wherein many of the other cloud based data warehouse systems, the DBA still has that more traditional role and is having to manage things like distribution keys and, vacuuming,

as well as the sizing

and forecasting

of, the utilization. And so that really does make a a big difference.

On the other side of the house, I start talking about data lakes

and semi structured data. 1 of the things that we did in Snowflake is we invented a new data type called Variant that allows you to load semi structured data into a column in a table and then access it natively

with some minor extensions to SQL.

So this saves people from having to parse it out and

basically ETL the data into traditional tables and columns

in comparison to things like,

Presto and Hive

and some of the other,

tools that are out there that evolved out of the Hadoop ecosystem,

where

those really are

SQL

interfaces, if you will,

on top of a file system. And so you do not have

the some of the traditional database

features such as transaction consistency and asset compliance,

those those aren't necessarily there depending on which rev you're using.

Again, those are all evolving, but what we've seen over the last, 10 years is what started out as very sort of, I'll say, minimalist

SQL

because we discovered everybody really understands SQL. They don't understand necessarily Java and MapReduce, that we had to build a SQL interface

to make those

Hadoop style systems work and today with the cloud some of it's just blob storage whether it's s 3 on Amazon

or

Azure Blob storage. We're still dealing with more or less a file system

with a SQL interface built on top of it. And so this does require

a lot more care and feeding

and puts limitations on you as to what you can do

from a traditional database perspective,

which we've got all built into

into the Snowflake

platform already. And 1 of the things that you mentioned in there is this variant data type for being able to load semi structured data directly into Snowflake.

And I know that that's 1 of the attributes that has led to the current shift from ETL to ELT, where you're doing the transformations within the data warehouse itself rather than upfront. Because I know that traditionally, there was a lot of planning and rigor that had to go into loading the data into the data warehouse to make it accessible for other people. And so I'm wondering how the performance

characteristics and flexibility

of Snowflake and the availability of this variant data type and some of the,

schematic flexibility

plays into

the overall data modeling requirements for using Snowflake

and some of the shift towards this ELT workflow.

It's been very interesting for me because I come from a traditional

relational background

with 3rd normal form modeling as well as dimensional modeling

and data vault modeling in the data warehouse world. So looking at Snowflake and the features that we have, even just, looking at the

the the semi structured data, so whether it's JSON, Avro, Parquet, things like that,

a lot of our customers have found they they can load that into a column. So, effectively, you have a table now where there's might be a couple of metadata columns and a variant column, and then

a single row in that table that in that variant, you'll have an entire JSON document that might have arrays and

nested arrays and all sorts of really interesting structures within the we'll call it the schema of the JSON document.

And our customers are able to build views on that using our,

JSON SQL syntax

and make it look like a dimensional model.

And I found this incredibly fascinating and also incredibly powerful

in

that we are now able to do things like load web logs

into a database table,

but make it accessible to, I'll say, your average business analyst

in a dimensional format.

So they can actually do analytics on this type of data now

using Tableau, using Power BI, using tools that they're very familiar with, and they don't even know that it's JSON data under the covers. And so the, the combination of being able to load that type of data,

the

SQL extensions to access it,

we've actually eliminated the the t now in the ELT,

extract load transform, the transform part

is now a SQL view,

and it's transforming the structure into something that a business analyst understands.

But we're able to then load data much faster and really reduce the latency

and the time to value

for that data.

And the

performance optimizations that are built into Snowflake

combined with our

ability to do these independent compute clusters, which we call virtual warehouses,

allows us to get the performance out of that.

So there there there is there's literally no indexing. There's no,

manual partitioning. There's no distribution

keys.

And so,

the DBAs are not having to spend,

you know, I'll say nights and weekends trying to make make this the the stuff perform.

We're able to just build the views on it, expose it via Tableau,

and the the business analysts now have rapid access

to all of this data.

And and and, additionally,

we can

allow them to join that semi structured data now to structured data that might have been loaded, say, from a CRM system

in a more traditional

scheme. From a data modeling perspective,

there there's 2 factors here with Snowflake. 1 in the scenarios we're talking about right now with the semi structured data is we're we're doing

almost virtual modeling. Right?

The modeling

approaches,

whether it's

3 NF, Data Vault, or Dimensional can all be done with views. So it's more about

what is the right structure

to represent the data in for the business use case

and that also means that you can have 1 set of data and have multiple representations

of it serving

different use cases.

And this really gets us down to that concept of the single source of truth that we've been trying to reach in the data warehousing world for for several decades.

The other aspect of modeling with Snowflake is the system was designed

to be what we call schema agnostic

where

many of the traditional on premises

database systems

really preferred a specific modeling technique, be it their normal form or dimensional for the most part, and required

specific tuning to make those things work. Snowflake was designed to be performant against

any modeling technique

because we wanted to ensure that wherever our customers were coming from in their legacy on prem world,

they wouldn't necessarily

have to change their approach to modeling

in order to get effective use out of Snowflake. And so that was actually designed from the ground up.

1 of the things that I didn't say earlier is our founders wrote Snowflake

completely from the ground up. It was a it's a brand new

relational database. So not only did they invent a new architecture

to take advantage of the elasticity

and the dynamic nature of the cloud and to eliminate the concurrency

and performance issues that people have had with the traditional systems.

They were were able to do this

only

by starting from ground 0 and not forking some previous code base. So it's it is all new

code resulting in this really dynamic

data platform that can be used for, as we were talking about, data lake types of applications,

as well as traditional analytics applications,

and be ANSI SQL compliant as well at the same time. And

there

seem to be sort of 2 driving factors for when somebody might decide between using a data lake versus a data warehouse. 1 of which being the flexibility

in terms of the types of data that you can load in and the availability

for how to analyze it, which is in large part addressed by the flexibility

in the schematic modeling that Snowflake offers, and the other 1 being cost. And so I'm curious

what that looks like in terms of a breakdown

for

the overall spend that somebody would be dealing with on particularly the storage side, but also to some extent the compute of

storing everything in an s 3 data lake or possibly even HDFS

versus what you'd be storing for the same volume of data in Snowflake, and then the amount of time and compute energy

required to be able to derive value from it, or the amount of data copying necessary,

particularly in the data lake scenario

of being able to then transform it into different forms for being able to run various analysis on it? Yeah. You know, that's a great question. 1 of the things with Snowflake is that we are, you know, MPP style. Right? So massively parallel processing

using the compute MPP compute nodes in in the cloud.

But on the storage side,

we actually have implemented

an advanced columnar compression

algorithm as well.

So

from a pure storage cost

perspective,

our customers pay basically the the same price for storage

as you would on the underlying

cloud vendor. So in the case of AWS, we're talking about s 3 storage

because under the covers, that's what we're using. But the difference is is we are giving them

3 to 5 x compression on that data. So from a pure data lake

storage perspective,

you can store that data in Snowflake

and, in all likelihood, be paying way less

because of the compression that we have. Our our terabyte per terabyte storage

cost is effectively the same as blob storage

except that in our case,

we're able to compress it. So you're gonna get from 3 to 5 terabytes

compressed into 1 terabyte of storage. So that's a huge cost savings for our customers, and that's why it's become feasible for them to consider

putting their data lake in Snowflake

as well as their data warehouse and doing all the analytics.

Now your your your secondary question there is, okay, if we put the data lake and we put all that data in Snowflake, how much is it costing us?

What does that look like to then transform that into

analytics style schemas?

And as I,

mentioned before, in many cases,

if we're talking simply about the, you know, the semi structured data and even structured data,

a lot of our customers are just using views. And so there's

the cost there ends up being in the compute to execute the query,

where they in the past, you may have had cost for doing actual ETL processing,

where we do certainly have customers that are doing

transformations within Snowflake as well where it's necessary to apply some advanced business rules, some data cleansing rules, things like that. So you do have that as well. But that cost in the case of the,

of transformations,

whether you're talking ETL or ELT,

is something that they were paying 1 way or another before already.

The advantage again with Snowflake is

our customers are finding that our our system runs way faster

than their their prior systems.

I know of 1 customer that had a a report that ran for 20 hours

on their data warehouse appliance, and by simply moving that data in the scheme over to Snowflake, they ran in 45 minutes.

And so they

experienced a massive cost reduction,

in looking at the cost of compute on Snowflake in comparison

to what that,

massive data warehouse appliance cost them.

And it it is a different cost model for sure. I mean, it is a pay as you go model.

In the case of Snowflake, we charge per second per node.

So when you create these virtual clusters,

we're talking about,

you know, how many nodes are there. Is it is it a single node? Is it 2 nodes? Is it 4 nodes? Is it 8 nodes? But we charge on a per second per node basis

and generally get linear scalability. So if somebody wants to run something faster,

they

use a larger virtual warehouse,

and it'll run-in half the time. But because of the the pricing model, it cost them exactly the same amount of money. It's just that it ran faster. And so then we start talking about the

the,

the value of the time

and what else they can do

when they have access to that data so much faster.

And another thing that you mentioned a couple of times is the fact that because the ability to create these views on the underlying data structures,

you, in many cases, can just have it living in 1 place, and you don't have to worry about copying it. And so I'm curious how that plays in to the requirement

for a master data management approach for ensuring that there are canonical schemas or canonical IDs for users or

sales products or things like that, and if that is

still a necessary aspect of data modeling and data management in this Snowflake world?

It is for certain use cases, and it really is about what the business requirements are. If you're talking about,

a lot of raw JSON web logs and you need to do data science on it

and try to derive some some,

insights from that data, then then they're pretty much good to go with with what we've been describing.

If you are looking at

building, I'll say, an enterprise data warehouse in a highly regulated industry that needs to be compliant with GDPR and CCPA and some of the other regulations that are out there,

then we do have our, I'll say, our enterprise customers,

you know, the the really big organizations

are still building

canonical schemas and doing

more of the traditional

transformations

and master data management type approaches.

I I think the the beauty of Snowflake is that it allows you to do either or.

You're not locked in. You have the flexibility

based on the requirements of your business

and and and your approach

and, what you're what you're trying to achieve.

So we do have people doing doing both for sure that are are porting over things that are what I call, you know, the the

highly integrated

enterprise data warehouse where it does require an integration layer using something like a data vault modeling methodology,

where you will have stage tables and you will have

a core granular

data warehouse that is aligning the business keys

and and making sure that when the reporting comes out that, you know, we we have a common agreed upon

representation for a product or a customer. So all of that still

does happen. And, again, it just it just varies from from customer to customer as to what their business requirements are. And so you mentioned that

the high level view of how Snowflake is implemented is that there's the underlying data storage in the cloud object store, and then you run the compute on top of that in a sort of virtualized

way and and in a way that gives you some segmentation in terms of

who can access what. But I'm wondering if you can dig a bit more into how the Snowflake platform is architected and some of the ways that it has evolved over the years. Sure.

So, again, the the storage layer, as we were saying, under the covers, whether it's on,

AWS

or,

Google,

is we're using

the

the blob storage native to that cloud

and we are writing our own proprietary

file

structures into that storage. As a Snowflake customer and user,

you don't see it as s 3 storage.

You don't see any storage, really. I mean, you look at a chart and see how much storage you're using, how many terabytes of data you're storing, but you you you can't access or see these files. It's all something

that's that we're built we've built that's, you know, patented file format

and structure and metadata around that and that's that's the storage under the covers. And then the compute layer is then using the compute nodes. So in the case of Amazon, we're talking about EC 2 nodes under the covers

complete with their SSD,

and that becomes our dynamic

memory, if you will,

along with

the number of threads on the compute cluster that someone has allocated.

But holding that altogether is what we call the global services layer, and that's really the brains of Snowflake.

And that is where,

the metadata engine is that's tracking where all the files are,

tracking who the the users and the roles. All the role based access control and the security

is all tracked in there as to who has access to what compute resources,

who has privileges to create new compute resources,

who has privileges to create new tables and load data, all of the standard things that you would need in a in a database from a role based access control perspective, but it also the global services layer manages the transaction consistency,

the ACID compliance,

and manages

all the security

from an encryption

perspective.

Our underlying

data is always encrypted. It's a it's 256 bit encryption

at motion

and at rest. So as you are uploading data into Snowflake,

that data is encrypted on the the the client side before it comes over the wire

in into the Snowflake ecosystem,

a fairly robust

hierarchical encryption scheme

to to make sure this stuff is all secured, complete with rotating keys because 1 of the biggest concerns in the cloud was always,

you know, is our is our data secure?

And I've worked in many traditional,

database environments over my career where the data was never encrypted and everyone was

was counting on the firewall

to to keep the data secure

when in fact people were able to get through those firewalls and read the underlying files and actually,

glean data out of those underlying files.

Snowflake was built to make sure that could never happen.

And because of that, we have,

we are HIPAA compliant, we're PCI compliant,

we have a SOC 2 type 2 and SOC 2 type 1 certifications

along with another a number of other security,

certifications

to ensure that our customer's data

is always secure, that there there's no way anyone can get to it. And so that's all combined there. So it's really the 3 layers, the global services layer, which is the brains,

your your compute layer, which is dynamic where you can create as many of these independent compute clusters as you want, And then under

at the bottom there is the common storage layer

allowing us to have 1 set of data, again, not having to copy it between silos and data marts, but to have 1 set of data accessed

by all these different,

compute nodes.

And my understanding from previous conversations that I've had,

1 of which was actually on this podcast, I understand

that the actual metadata layer for Snowflake is implemented on top of FoundationDB

so that you actually have

1 location for being able to identify where all these files are located in the broader s 3 storage rather than having to actually query the s 3 API and deal with those latencies for being able to seek through information as you're trying to run a query. I'm wondering if you can talk to either that or any of the other performance optimizations that are built into the system to ensure that it is,

operating at the sort of maximum performance that you're, that you're looking for and to be able to provide that, reduced time to value for your customers.

Yeah. And that and that's absolutely correct. What you were told before, it is under the covers, our our metadata

engine

is based on FDB,

and we we retain, I'll say, the addresses

of all of the data in there so that when someone is querying the data,

it goes in there and

looks,

for for the data, but depending on the nature of the query, the names of the tables, all of that's there and has pointers directly to our underlying file system to to, find that data as quickly as possible.

And, in fact,

we we keep the,

along

along with all the pointers there, part of it is

keeping the range

of the data to know that, you know, from

a to z

of this particular column,

you can find that in this set of files,

under the covers. So we have that sort of direct pointer there. So when,

people are writing queries,

you know, based on the the predicates and the join conditions,

we have a lot of metadata information to allow us to go directly

to

the files where that data is really stored. I mean, again, to a customer, it's a table. Right? But under the covers, those tables are made up with a lot of very small files, which we call micro partitions,

and our 1 of our primary

methods

of performance

is called partition pruning, and it's based on the ability of the system

to look at the query,

go to the metadata,

and then go directly to the set of files

that contain the data that this query is asking for. So we're not having to do what, you know, would have been called, I'll say, in the the old database world, a full table scan. You know, obviously, if somebody needs to do that and they wanna do an aggregation on a trillion row table, well, it's gonna do a full table scan. But if you're doing something more of a dimensional analysis

and we wanna see all the sales for product x for region

y, for sales rep Kent,

we're gonna be able to hone in on that data set very quickly and only return that set

even if we're talking about,

multi terabyte tables

and even into the into the petabyte range.

And then in terms of the overall system architecture and the implementation

details, I'm wondering if there are any sort of edge cases

or limitations that you're dealing with or any of the specific challenges for being able to design and implement this across multiple different cloud vendors.

Oh, yeah. That's that that's,

interesting. That gets into the fact that we are we are cloud agnostic.

And there

hasn't in the experience that I've had with Snowflake so far, there doesn't seem to be anything that we can't do on 1 that we

or that we can do on 1 that we can't do on another.

And that was again the the goal of the founders is they they develop Snowflake with this global services wrapper around it.

And we're really writing to the APIs of the underlying cloud vendors

so that we can ensure that whether you choose AWS

deployment or Azure deployment or Google deployment, your Snowflake experience is exactly the same in all 3. And 1 of the things that is is coming

that that we just released earlier this year is the ability to not only

do,

cross region replication within

a cloud provider,

but we now have the ability to cross cloud replication.

So you could actually have be running Snowflake on AWS

and replicate it to Snowflake on Azure

as a,

as a mechanism. In in some cases, some of our customers wanna do that for disaster recovery and business continuity.

In other cases, it it may be for latency

reasons. And in some cases, it may be for business reasons where there's certain datasets that for whatever reason,

they only want it in the Azure space and other datasets they may want

or need to have in AWS space for, you know, for various various business reasons. So we've we've done quite well because

the system was designed from the get go not to be single cloud. It was designed

to be cloud agnostic. And so we have not had had these issues where, hey. You know, this feature works great in AWS, but it doesn't work in Azure. And then

for people who are interested in getting started with using Snowflake

and loading their data into it, what are some of the

options that they have for being able to populate the data warehouse that they set up with their data, and some of the overall workflow

of then allocating and scaling the compute capacity for being able to run analyses, and maybe some of the options for cost optimization to ensure that they don't leave those clusters running when they're not actually being used? Yeah. I'll start with that 1 first because that's the easiest 1.

We have when you create a virtual warehouse on Snowflake, we have an auto suspend and an auto resume feature

because it is a pay as you go. And what we've discovered over over the years is many workloads do not need to be running

247, 365.

In fact, in in

business

intelligence

and dashboarding,

all that. In many cases, it's, you know, people come in, fire a query, they get the results back, and they're good for the day, really.

And so we have this auto suspend feature where the customer controls that with a a parameter to say if the compute is not being used for x number of seconds, minutes, hours,

up to them,

it automatically shuts down.

And then the auto resume feature, the next query that comes in to hit that cluster, it automatically turns it back on. And so that's that's a

a great cost control

and, people are finding that, you know, they are saving certainly a lot of money that way in that we're running we're we're we bill at a per second basis. So something runs for 5 seconds and, you know, you pay for 5 seconds.

Runs for 5 minutes, you pay for 5 minutes. It runs for an hour, you pay for an hour.

1 of the other scaling features is we do have the ability to dynamically

resize.

So say you are running a

use case that normally runs for an hour and you're running it on a small,

virtual warehouse, which is 2 nodes,

and you need to get it done in 15 minutes. Well, then you

you alter warehouse resize to a large and now it's running on 16 nodes

and it's gonna run-in 15 minutes instead of an hour.

But it costs the same because it's per second

per node. And so that's a that's a really cool feature that we have that that people

do use

on a fairly regular basis.

On the storage side, you know, how how do we get our data into Snowflake? And that's, you know, obviously, a very common question. How do we get data into the cloud if all of our data is in data centers?

And

we have a a very open ecosystem

of connectors,

where we started off with, we have a native Python connector,

an ODBC connector, a JDBC connector.

We even have a,

an optimized spark connector.

People were writing their

ETL using spark, which

many organizations

have been doing.

It's allowed all the I'll say all the traditional

ETL vendors and ELT vendors,

have connectors.

And so if you're currently using

SSIS

or Talend

or

Informatica,

you can repoint that to Snowflake

and continue to use the same flow that you're using today.

Then, of course, there's there's new new ones like Fivetran and that are tools that are more born in the cloud themselves.

You have an option to do those. In fact, we even have a, we have a a partner connect

feature. There's a a dashboard when you log in to Snowflake where you can sign up for free trials for some of these tools.

And I've heard from customers that are, you know, doing a POC with us that, you know, they got access to their new Snowflake account. They went in, and within 30 minutes, they were up and running provisioned with 1 of our our partner tools

and loading data from their their old database in Snowflake.

And so there's

any number of options.

You can use Kafka,

You know, if you happen to be running in a Kafka environment, we,

just released our our Kafka connector

using using Confluent and so people are able

to run data from Kafka straight into Snowflake.

We have,

you know, the I'll say the traditional methods.

If you could upload your data into

an a blob storage bucket and then copy it

into a table in Snowflake.

We have our own

ingestion engine called Snowpipe

that allows us to take data that is dumped into

a blob storage bucket and automatically

pick that data up and load it into tables in Snowflake.

And there's even a

a button on our web UI where you can

click a button to load data in and specify the table you wanna load the data in, specify the file you wanna pull it from, and you could be pulling it off your laptop

and and more or let you know, basically manually load it using our web UI. So there's there's a

a very large

variety

of options for customers to choose from

in,

how they're gonna load their data into Snowflake.

And another 1 of the interesting features that's enabled by your architecture

is the data sharing capabilities

for

being able to say that this data set is available to somebody else who's running a virtual data warehouse. So that opens up the possibility of things like data brokerages

or allowing a certain subset of data to be shared between different business units or organizations. And I'm wondering what you have seen as some of the most interesting or unexpected uses of capability.

Oh, yeah. That's a that that's

pretty much everybody's favorite feature these days is is data sharing

that, you know,

any organization now can become a data provider

and effectively build what I refer to from my architectural framework as a curated data mart and

grant

access to that data mart to any other Snowflake account holder,

and it appears to them as a read only database.

And that has evolved now into us having, we now have the

Snowflake data exchange,

which is

available in,

it's a public preview now on 1 of our regions

where customers have signed up and are actually making data available like weather data.

AccuWeather

has a has a tile on there that you can sign up to to get weather data

and get access to that to augment your analytics in your data warehouse on Snowflake. And then just this week at, AWS reinvent, we announced the

the,

private data exchange that we're now actually gonna be allowing customers to internally within their organization

rather than a public data exchange

to find these datasets into a much, I'll say, user friendly

interface

and specify

other organizations

within their company or potentially specific business partners that they wanna allow access to

this data in a in a very

resilient manner and a very secure manner.

The most 1 of the most interesting

usages of this was very early on

with, 1 of our customers that had a partnership with another Snowflake customer,

and they provided analytics

on YouTube videos.

And so they were collecting

all of all of that sort of data and shared it over to the other

Snowflake customer who then did augmented analytics on it with all more consumer related data that they had. And so they were effectively joining that data

to the, the YouTube analytics

and providing a different level of analytics.

And then in turn sharing that data back

to the original data provider.

So, effectively, the raw data is coming from 1 organization

and sharing it to a partner organization who then augmented it with their own data to get a more sophisticated

data set and then sharing it back to the original provider.

And then the really interesting thing about it

was when

the original consumer turned around and bought the provider,

and, it actually became 1 company.

And the amount of money that they saved by not having to do manual transfers of this data

was incredible.

And the the their,

their return on investment

was quite amazing to see because it eliminated,

even, you know, before the acquisition,

they were dropping files onto a secure FTP, and then the other organization has to have an ETL process to ingest it into their data warehouse.

They were doing their analytics on it and augmenting it and then having to spit that data back out to flat files to a secure FTP site, and then the whole process goes over again.

And there was that was quite fascinating to to see that particular use case because,

I I call it bidirectional

data sharing.

And, it's,

it's really interesting to see what people can do with these features. And and in many cases, it may be

use cases that, our founders might not have even thought of when they first started.

But it was all a result of the separation of compute from storage because the storage is centralized separate from the compute. That's what allows us to do data sharing

and to and to create this logical

share container on the storage layer

and have it accessible

by,

compute

that's in a completely different Snowflake account. And what are some of the other

features that are what are some of the other features or use cases for Snowflake that are not as well known or as well publicized as this data sharing capability, which you think users should know about and would be able to benefit from?

Well, there's there's a couple that I think, you know,

that DBA is really love. There's a feature called undrop

that allows you to instantly

recover

anything from a table to a schema to an entire database. So the the classic Monday morning, oops, somebody comes in, thinks they're logged into development, they're logged into production,

and they drop a set of tables because they're about to do the next iteration

and then realize, wait a minute, I was in production.

They simply issue a command, undrop table, undrop schema, whatever it was, and it's instantaneously

back.

And that's 1 of the,

I'll say it's a side effect of our feature called time travel

which,

allows you to,

query your data

as it was

at a previous point in time. And by default, all of our customers get 24 hours of time travel.

So if you're running an ETL process

in the morning and later in the afternoon, you discover that, hey. You know, there's something wrong with that process.

They can very easily roll it back by you know, they can they can take a look at what it looked like before the process ran. And you can go all the way up to 90 days with it. So it's

there's literally no configuration

other than setting the parameter.

But because of that

feature, that allows people to do things like undrop. You know, if you've got a 90 day time travel window, you could actually recover something that you dropped

89 days ago

instantaneously,

and that's a that's a really cool feature. 1 of the other ones which I don't think people know a lot about is,

we actually do result set caching.

So for

when you run a query and it produces a result set,

that that dataset is actually

cached for 24 hours.

And you can

go into the history tab in the Snowflake UI

and review the query that you ran, you know, 23 and a half hours ago and see the dataset and actually take a look at that dataset without having to re execute the query, and so there there's no cost involved in in that particular 1. And so from a from a development perspective, I found that very, very useful.

The classic, you know, 5 o'clock, I'm I'm working my way through building out a particular SQL query,

go home, come back in the the next morning and trying to remember where, you know, where was I. Well, instead of having to rerun the last query that I ran before I went home last night, I simply go back,

and look at the results cache and I can see the dataset, I can see the query, and then I can continue on from there without having to have the cost of re executing the query the next morning. And so with all of the different capabilities and flexibility and benefits that Snowflake can provide, it's easy to assume that it's the right choice for every situation. But I'm wondering what you have seen as being some of the cases when Snowflake is the wrong choice.

Well, for 1, 1 1 of the things that people used to ask me when I first started because we always said, you know, we're a data warehouse. It's like, well, that sounds like it's a database.

And, yes, Snowflake is indeed a a database and now we we say say Snowflake is really a, a fairly robust data platform.

The 1 thing you absolutely would not wanna do with Snowflake really is an OLTP

application

where you're doing,

singleton

inserts and and queries because it is an MPP system.

There's 8 threads on our smallest virtual warehouse

and you're certainly gonna be

grossly underutilizing

it if you're doing singleton inserts. So any sort of really single row

type of application is probably not not necessarily

the best.

You know, microbatches,

continuous

continuous feeds, things like that are okay, though we are working,

as part of 1 of our enhancements working on the

improving our ability for

low latency type

requests,

but you really wouldn't wanna build like a order entry system on Snowflake. Even though you could with 1 of the you could use our Node. Js connector and build a a very fancy,

web UI order entry system from scratch, but that would just not be the the right thing to do with Snowflake. And what are some of the plans for the future of Snowflake DB either from the technical or business side?

Well, we the on the from a business side, the whole, business continuity story and global Snowflake is very big where we are are moving forward with our

cross cloud replication.

And 1 of the things on the road map will be an automatic sort of fail over that if 1 if your primary

Snowflake

account fails for some reason, you know, the

the service goes down,

the

telco, the lines go out for a particular availability zone that it will automatically fail over and automatically fail over the

the applications pointing to Snowflake.

So that that is on the road map. It's a thing we call organizations.

They're gonna allow you to have basically 1 organization with multiple Snowflake accounts under it that are all very well connected.

Another is our,

geospatial support.

So we're going to be rolling out some,

some support for geospatial

data types including

geo

JSON. And you can imagine there's certain industries that that's going to be,

that's gonna be a really big hit in.

So I know that's that's coming down the the the pike. And then with all the regulations

and concerns over privacy,

we are going to be adding

column level access controls including

dynamic data

masking. And that's gonna be, you know, that's a

a feature that's been requested

particularly in in health care and financial industry space

to be able to do it not just at a row level but to do it at a column level and to incorporate

data masking actually

data masking functions into into Snowflake directly. And that's 1 thing that we didn't really touch on yet is the specific SQL dialect that Snowflake is using. I'm wondering how closely it's sticking to the ANSI standard, and what are some of the interesting

extensions

or built in functions that it provides?

Yeah. Yeah. So it is it is an ANSI

ANSI standard SQL.

The extensions, I think I mentioned a little bit already with, the extensions we've had for accessing,

keys within,

JSON and semi structured data as well as XML, and so there's some extensions there.

We have a very extensive

SQL function library which people can check out in our documentation.

You can just go to docs.snowflake.com

and look up the SQL library. And so we have

a full set of windowing functions,

and statistical,

standard SQL statistical functions, things of that nature that are already in there. We also have the ability to

for customers to write user defined functions,

as well as recently we've added support for stored procedures,

Kind of back to our

ELT conversation a little earlier,

we recently introduced

a feature of streams and tasks

that will take advantage of stored procedures to

do some change data capture within the Snowflake engine. So as

new data is loaded, you can you can create a stream object on top of the table

to look at things that have changed and then launch a task to move the data further downstream. So we're really working on,

expanding

true what I'll call true ELT

capabilities within the Snowflake engine itself.

But all standard SQL,

you know, standard SQL functions to do the transformations

and calculations that people have typically needed to do,

and even to the point of doing some

data science type calculations. 1 of our customers,

built a data science application completely using Snowflake SQL,

that the data scientist said everything he needed was was in the library,

to do the statistical analysis that he was trying to achieve.

Are there any other aspects

of the Snowflake platform or the ways that it's being used or the use cases that it enables that we didn't discuss yet that you'd like to cover before we close out the show? Wow.

It's being used in pretty much every vertical. So we've got we've got folks that are using Snowflake now for security analytics,

and we have an open source

system that we've put out called Snow Alert,

and that's that's a really interesting use case where people are even using Snowflake to

replace some of the functionality that they were previously using, things like Splunk for. Marketing analytics,

we have

quite a few customers that are are doing,

marketing analytics using Snowflake and then certainly health care as I've already mentioned.

We have well over a 100 health care related

companies doing health care analytics.

Some of companies are actually building analytic applications on top of Snowflake.

So they're providing

an analytic service

to their customers

that's powered by Snowflake under the covers. We're starting to see, I'll say more and more of of that sort of thing,

kind of in many cases in combination with data sharing.

So, you know, lots of lots of interesting

applications in this sort of hybrid

being able to do your data lake and your data warehouse all in 1 platform is a is certainly a growing use case. And with the addition of streams and tasks and our ELT capabilities,

I expect to see that growing

even more over the next couple of years.

And so for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. There there's a lot of tooling out there. I mean, the the the biggest

question I keep getting on is things like data catalogs

for large enterprises, and there there is some tooling out there for that. We have a number of partner companies

that specialize in data catalogs,

and I see more of a,

knowledge and process gap honestly than a tooling gap. It's just people understanding

the need for data governments

and some of the best practices

around data management

that are are necessary

to

make effective use of

all these, you know, really large datasets that people are now starting to deal with.

And, it ends up being more of a process than a technology

question right now.

The

the only thing I'll say on the Snowflake side that we we don't handle at this point is,

things like images

and sound, you know, m p threes, you know, closer to what people would think of now truly as unstructured data,

though in reality, they all have some amount of structure in it.

So that that's that's a little bit of a gap that I think we have right now that I know we're we're looking at, and we have customers asking us about that, about, you know, how how we can integrate some of some of the use cases around that as well.

Well, thank you very much for taking the time today to join me and share your expertise

experience of working with Snowflake. It's definitely an interesting platform and 1 that has been gaining a huge amount of popularity and mind share. So it's great to see,

the directions that it's going, and some of the capabilities that it provides. So thank you for all of your time, and I hope you enjoy the rest of your day. Thank you very much.

Thank you for listening. Don't forget to check out our other show, podcast dotinit@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways that is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at dataengineeringpodcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends

and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links