Keeping Your Data Warehouse In Order - Episode 102

Summary

Managing a data warehouse can be challenging, especially when trying to maintain a common set of patterns. Dataform is a platform that helps you apply engineering principles to your data transformations and table definitions, including unit testing SQL scripts, defining repeatable pipelines, and adding metadata to your warehouse to improve your team’s communication. In this episode CTO and co-founder of Dataform Lewis Hemens joins the show to explain his motivation for creating the platform and company, how it works under the covers, and how you can start using it today to get your data warehouse under control.

Join us at the Data Orchestration Summit on November 7 at the Computer History Museum in Mountain View hosted by Alluxio! This one day community conference is focused on the key data engineering challenges and solutions around building analytics and AI platforms. Attendees will hear from companies including Walmart, Netflix, Google, DBS Bank, on how they leveraged technologies such as Alluxio, Presto, Spark, Tensorflow, and you will also hear from creators of open source projects including Alluxio, Presto, Airflow, Iceberg, and more! Use discount code PODCAST for 25% off tickets. Admissions also includes a free training session on getting started with Presto and Alluxio in AWS run by the creators of Presto and Alluxio. Attendees will takeaway learnings, swag, a free voucher to visit the museum, and a chance to win the latest ipad Pro!

Data Engineering Podcast listeners get 25% off with discount code PODCASTRegister here!


linode-banner-sponsor-largeDo you want to try out some of the tools and applications that you heard about on the Data Engineering Podcast? Do you have some ETL jobs that need somewhere to run? Check out Linode at linode.com/dataengineeringpodcast or use the code dataengineering2019 and get a $20 credit (that’s 4 months free!) to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • This week’s episode is also sponsored by Datacoral. They provide an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure. Datacoral’s customers report that their data engineers are able to spend 80% of their work time invested in data transformations, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from mere terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit Datacoral.com today to find out more.
  • Are you working on data, analytics, or AI using platforms such as Presto, Spark, or Tensorflow? Check out the Data Orchestration Summit on November 7 at the Computer History Museum in Mountain View. This one day conference is focused on the key data engineering challenges and solutions around building analytics and AI platforms. Attendees will hear from companies including Walmart, Netflix, Google, and DBS Bank on how they leveraged technologies such as Alluxio, Presto, Spark, Tensorflow, and you will also hear from creators of open source projects including Alluxio, Presto, Airflow, Iceberg, and more! Use discount code PODCAST for 25% off of your ticket, and the first five people to register get free tickets! Register now as early bird tickets are ending this week! Attendees will takeaway learnings, swag, a free voucher to visit the museum, and a chance to win the latest ipad Pro!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Lewis Hemens about DataForm, a platform that helps analysts manage all data processes in your cloud data warehouse

Interview

  • Introduction

  • How did you get involved in the area of data management?

  • Can you start by explaining what DataForm is and the origin story for the platform and company?

    • What are the main benefits of using a tool like DataForm and who are the primary users?
  • Can you talk through the workflow for someone using DataForm and highlight the main features that it provides?

  • What are some of the challenges and mistakes that are common among engineers and analysts with regard to versioning and evolving schemas and the accompanying data?

  • How does CI/CD and change management manifest in the context of data warehouse management?

  • How is the Dataform SDK itself implemented and how has it evolved since you first began working on it?

    • Can you differentiate the capabilities between the open source CLI and the hosted web platform, and when you might need to use one over the other?
  • What was your selection process for an embedded runtime and how did you decide on javascript?

    • Can you talk through some of the use cases that having an embedded runtime enables?
    • What are the limitations of SQL when working in a collaborative environment?
  • Which database engines do you support and how do you reduce the maintenance burden for supporting different dialects and capabilities?

  • What is involved in adding support for a new backend?

  • When is DataForm the wrong choice?

  • What do you have planned for the future of DataForm?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:11
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, or you want to test out the project to hear about on the show, lean somewhere to deploy it, so check out our friends at live node. With 200 gigabit private networking, scalable shared block storage and a 40 gigabit public network you get everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution they've got that coverage to with worldwide data centers, including new ones in Toronto and Mumbai. For your machine learning workloads. They just announced dedicated CPU instances. Go to data engineering podcast.com slash Linux that's LINODE today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of this show. This week's episode is also sponsored by data coral They provide an AWS native server lists data infrastructure that installs and your VPC data coral helps data engineers build and manage the flow of data pipelines without having to manage any of their own infrastructure. Data corals customers report their data engineers were able to spend 80% of their work time invested in data transformations rather than pipeline maintenance. Roku Murthy founder and CEO of data core Oh builds data infrastructures at Yahoo and Facebook scaling from mere terabytes to petabytes of analytic data. He started data Cora with the goal to make sequel the universal data programming language. Visit the data engineering podcast.com slash data coral today to find out more. Are you working on data analytics or AI using platforms such as presto, spark or TensorFlow, check out the data orchestration summit on November 7 at the Computer History Museum in Mountain View, California. This one day conference is focused on the key data engineering challenges and solutions around building analytics and AI platforms. Attend We'll hear from companies including Walmart, Netflix, Google and DBS bank on how they leverage technology such as a luck, CO, presto, spark and TensorFlow. And you will also hear from creators of open source projects including Alexia presto, airflow and iceberg many of whom you've heard on this show. Use discount code podcast for 25% off of your ticket and the first five people to register get free tickets. Register now is early bird tickets are ending this week. attendees will take away learnings swag, a free voucher to visit the museum but a chance to win the latest iPad Pro. You listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as diversity Caribbean global intelligence Alexia and data Council. Upcoming events include the combined events of the data architecture summit and graph forum, the data Oracle summit ended data Council in New York City. Go to data engineering podcast.com slash conferences today to learn more about these and other events and to take advantage of our partner discounts to save money when you register. Your host is Tobias Macey. And today I'm interviewing Lewis Hammons about data form a platform that helps analysts manage all data processes and your cloud data warehouse. So Louis, can you start by introducing yourself?
Lewis Hemens
0:03:24
Hi, I'm this. I'm CTO and co founder of data form. So I previously worked at Google was a software engineer at Google, everything we did was about data. I was there for about six and a half years and left two years ago to some a company. And we actually worked on a few different ideas before we came stage form. We took part in Y Combinator last back in 2018. And we got funding for data form last year, and we're now a team of seven or so. So we didn't actually start with data management. When I left Google. We actually worked on some different products in the day space around automate automated data mining, we quickly realized one of the biggest challenges our users were facing was around data management. Of course, we can do any meaningful data mining, if the input to that process was not clean, well prepared data. So this is kind of how we ended up where we are. And at Google, we were very spoiled with many great tools to spin up new pipelines to do data modeling, and build scalable sequel data processes. We felt that this was a far more interesting problem to solve. So we we made the move. And do you remember how you first got involved in the area of data management? Yes, so so when we were working on our first idea, after leaving Google, we found ourselves the first thing we do with any customer we worked with was to go and write some sequel for them. We would start by modeling their data we would pull a number of tables together we do joins today's cleaning and the We ended up stuck because we needed a way to give this back to them so that they could display it, deploy it, so to speak. And this is something that most companies didn't really have set up, or particularly the startup, so you're working with didn't really have any way to do this. So we came across DVT, we saw what they were working on, and we loved it. We thought that one of the biggest challenges to picking this kind of solution up was to package it in a way that it could be used by maybe data analysts who are less comfortable with command line interfaces or writing code get so on. So actually, our initial version of the product was a web based ID for DVT projects. And we'd spoken with Justin and drew from Fishtown analytics while we develop this few months down the line, we decided to move off our framework in 70 BT, which we can get into a little bit. But that's kind of the origin story of of data form. I'm curious
Tobias Macey
0:05:55
what the motivation was for starting your own command line framework to Replace DB T and what the sort of main differences are in terms of capabilities or workflow that you are building into your tool and how it compares to the overall workflow of DVD for people who might be familiar with that?
Unknown
0:06:15
Yeah, absolutely. So this one is two things we think about is two things. So there is a data modeling framework, which is comparable to DVD. But I think the bigger part of what we've been developing and where we, we think most of the value comes from is we have a web based development environment for data from projects where you can develop sequel queries, you can validate preview, test new models, changes and keep everything tracked in a get repository. So this was what our initial ID was on top of DVD. The migration away from the DVD framework was not for any particularly major reason. And, you know, if I'm honest, our framework is actually very similar to DVD concert. Actually, it just allowed us to move a little quicker and do a few things that we were struggling to do. So each form is designed for analysts and engineers. And this web based environment kind of gives us the best of both of those worlds. And allows teams to adopt those engineering best practices, while maybe not having everybody on the team super comfortable with doing those things manually via command line or fire code.
Tobias Macey
0:07:29
And I also noticed that the command line agent that you wrote is largely written in TypeScript, whereas DB T is primarily Python. I'm curious if the motivating factor for that is because you're using a decent amount of that logic in the web environment as well to reduce the overall burden of development.
Unknown
0:07:50
Yes, so that is definitely a factor. There were a few factors in us going that way. Personally, I'm a big fan of types languages. That was something that drew us towards something like TypeScript. Absolutely right. Being able to develop the framework and the back end and the front end all in the same language, using the same details is a huge benefit to us and helps us move quite quickly. Given that I think we knew from day one, the most valuable part of this was really packaging this up into a into a web based environment.
Tobias Macey
0:08:25
So can you talk through a bit of the workflow and lifecycle of an analytic project that is using data form and how it allows for collaboration between data engineers and data analysts and what the sort of primary users are at each of the different stages?
Unknown
0:08:43
Yep. So we see this shift moment of and I think DVDs driven this definitely and engineers have been doing this for a long time but to have all of your everything version controlled, and tested is extremely valuable. And that is where a lot of the The core value from the software engineer engineering best practices comes from. So, looking at a data from projects in the web UI, you typically start off by coming into a project and wanting to make a change that might be publishing a new data set, or a table or view or some other operation. You create a new branch, which can be done directly from the UI. And then you can start writing sequel in a what looks like your standard code text editor. The query that you write will be compiled and validators as you write it. So we'll actually test your query against your warehouse. And we provide a number of built in functions for building dependencies between different actions in in your larger project. As you develop, you can test your query preview the results. We allow you to run portions of your pipeline in isolated development schemas so that you're not over it. Production as you develop new changes, and then there's a few things you can do on top of that. So one of the core features, which seems to be a, you know, big selling point for data modeling frameworks, in general, I think is the ability to write tests and assertions about your data. So to make sure that the query you've written actually does what you expect it to do, maybe previously, you would have just run the query and had a look at the results and checked, but if you actually want to put in a test that will fail in the future, if somebody comes along and breaks it, then you can write in a session. You can write documentation for data sets to describe the table as a whole or column level descriptions. Or you can override on settings like the schema, the name of the data set, set tags, setup schedules. So finally, when you're finished, you can take all of these changes that you've made. You can commit them directly from the web UI or from the command line interface. If you're doing it manually. Push them to production And to your GitHub repo. And this form will pull those lists of changes, and those will be
Tobias Macey
0:11:06
deployed to production. And you mentioned that the decent amount of the utility of a tool such as data form is that it helps to encourage and enforce best practices such as version and testing. And one of the things that I found interesting as I was going through the documentation is that the form of the tests is actually that you build a query that is intended to not return any results. So essentially, the inverse of the intended query is how you actually generate the test and that if there are any results returned as when it fails. So I'm just curious if you can talk through some of the challenges or how users have found that in terms of being able to understand conceptually how to build effective tests and how the continuous integration Continuous Delivery workflow manifests. In the context of data warehouse management using tools like data form,
Unknown
0:12:04
yes, so the the concept of an assertion is a bit jarring at first. In practice, it's simply a lot easier to write these kind of checks to look for broken rows in your data, then to look for the presence of something. This is a concept DBT head and actually was also concept we had in similar platforms at Google. So I think that learning that can learning honking, be got over quite quickly when it comes to ci CD, so I guess we should break these out. Continuous integration of ZM. As part of that process that we talked about when you're making a change to your data pipeline, this is I don't think we always think about it as software but it is software and you want to you want to test that that actually works and you want to make sure that those assertions passed. During doing ci in data environments can be tricky because Data pipelines can take a very long time to run, it could be several hours before you get a result back for a particularly large project. So that that makes ci tricky. And we can apply some maybe some clever things to try and speed that up. For example, running your entire pipeline in a staging environment with only 1% of the data is processed or where we don't materialize any of the results, we simply create us. So there are things we can do there. When you actually build your digital project as well. There is a validation step going on as you write sequel and develop the project. So we will check for obvious areas such as circular dependencies, which we can catch before you commit and push those changes to production. So this all kind of helps with ci in general, continuous delivery is a lot easier in a data environment. This simply deploy these things to production when we're ready. And yeah, that's pretty much it.
Tobias Macey
0:14:00
In terms of the speed aspect of that, I'm sure that having the tests structured in a way that they're intended to not match anything will help to speed up the overall execution. Whereas if you're trying to, as you said, assert the presence of certain records or certain data types means that you're actually going to be doing a lot, a lot more processing in the database of the records to be able to return something potentially quite large and the results and then having to do validation against that.
Unknown
0:14:29
So I think that's probably not true for most warehouses, though. It's going to depend on the warehouse soon when we actually execute an assertion is we take the Select query that you've given us, and we wrap it up in a select someone from your user generated code. So I think actually, whether roads exist or not, because the data is never being put anywhere, the execution costs is the same. One of the goals that we had in designing the system is we don't want to data to ever leave your warehouse. And I think that's something too important to understand the data form. It's not it's not running, it's not processing your data itself. It's a it's an orchestrator, it executes those queries for you. So when you write a, an assertion, we actually create a view in your data warehouse, which contains the query that you've provided us with. And then we query that. One of the benefits of this is if your assertion fails in production, and you want to know why you can go to that table and have a look, query it yourself and try and work out what's going on without data actually leaking from from your production data wells.
Tobias Macey
0:15:40
And then in terms of the engineering best practices that you're trying to enforce beyond just testing, I'm curious what you have found to be common challenges and mistakes among engineers and analysts in terms of burgeoning and evolving schemas and the accompanying data, as well as just General collaboration best practices to ensure that all the engineers and analysts are on the same page and able to work effectively.
Unknown
0:16:08
Yeah. So we see a number of changes and other issues come up. Often. Typically, when people have moved from a non data warehousing world, maybe working with more traditional warehouse on database, like my sequel or sequel server, I think a lot has changed about how to think about this stuff. This was definitely something that we were very used to a Google data should be. So source data should generally be immutable. When you're reading data from logs, transaction logs or event logs from your website. And that data should ideally never change. And similarly, when you're reading data from a database, you don't you ideally want to read that in a way that you have a full history of everything that has changed. So when we come to transforming the data Producing derived data sets, which is what you're primarily doing with the products like data form, you want to be able to build those dr data sets from scratch. If you lost them all, if you accidentally deleted all of those data sets, you want to be able to recreate that entire warehouse state again. And we often see the first people thing people do when they start using our product is actually write more like sequel scripts which perform some sequence of operations. So creating tables, inserting into those tables, deleting from the tables, this is kind of staple, and it becomes very hard to reason about. And I believe drew mentioned this on the DVD podcast where ideally you want to be a bit more declarative everyday search should ideally be a select. There are some situations where this doesn't work. GDPR probably being the main issue there. But because you can get to having your datasets being declared in that way and to not mutate state or instantly rose from these areas to reason about them. Make sure that your entire data pipeline is reproducible. So coming up to some other things that we see, I think, sequel adoption is growing. And I think the power of modern warehouses is enabling that. We definitely see some people still kind of falling back to you processing data with Python, dumb pie are. There's a problem with scalability here. This is that simply never going to be as fast as doing a comparable thing in sequel, they're usually limited to a single machine or some ram limitations. Whereas you can come to be process terabyte data sets with the curry or snowflake. And so that's, that's a habit that I think we've had see. I know myself as a software engineer, I want to fall back to what I know and what I understand. But moving this stuff into sequel using ETFs, where you can actually opens up a lot of possibilities when it comes to schema evolution. I think a lot of these Challenges go away if you get these first things, right. So having a mutable source data sets and having reproducible transformations of those datasets, and having all of your code in one place, makes it much easier to make massive changes across those systems. Because you remove the state, it's much easier to change a data set and all its dependencies in a single pass, and did it fit scheme, evolution becomes a bit less of an issue. At some point, scale does become an issue. And you can start looking at approaching those schema changes in a backwards compatible way. And this is similar to how you might approach this problem in software engineering. For example, we have many different micro services, they might get released on different schedules and you want to make sure that they continue, they can continue to talk to each other as that software rolls out. So for example, removing a field from a table because it's no longer valid. You typically do that in a software engineering way by marking that field is different. gated giving those consumers some amount of time to clean up their while in this case sequel queries that use that data. And then when you know all consumers are no longer reading it, you can remove it safely. tools like data for make this kind of action a bit more feasible to accomplish because you can see you have that sequel in one place, you can actually analyze and work out who's reading this field, who do I need to notify? And you know, when all references to it has been removed,
Tobias Macey
0:20:29
and digging into data form itself more can you talk through a bit in terms of how its implemented and some of the evolution that it's gone through since you first began working on that as a replacement for DVD.
Unknown
0:20:44
At a high level, the digital framework takes a different project folder containing sequel X Files, which is a format that we've created, as well as JavaScript files, custom music, code, and output save big JSON objects. That JSON object describes every sequel action is dependencies and other metadata. So unit tests, assertions, column descriptions, tags, those are all contained in this one data object, which is no longer user code is declarative. So we actually break the the stage from process down into a few different stages. The first stage is compilation where we produce that first Jason graph. The second step is we call it build, where we take that graph and we combine it with information from the data warehouse, which can which would be state such as what tables currently exist, are they tables of views, what columns are in those tables, and we use that to generate an execution graph, which is a kind of final ready to run list of every single sequel operation. They were going to actually execute as part of the graph. So this is the difference between those two grasses, one contains Select statements. And the second one actually create contains the Create, delete, ALTER TABLE statements. The boilerplate that tools like date form, avoid you having to write it and think about as a user. So finally, when you have that graph, data form can execute that graph and will execute notes in parallel. And when you have anything fails, it will run your assertions and tests and finally tell you what's happened. And via the command line, this is kind of all participants pretty printed out and fire the UI, we keep a track of those logs and we send you emails and notifications and all that stuff when something goes wrong. In terms of the evolution, as I mentioned, we are of course inspired very much by DVT. Some of our design goals were primarily to make it fast. In order to make the ID powerful and usable. We wanted real time compilation. So we wanted to make sure that the data from projects could compile within a second, so that you could actually see what sequel you're generating before it got validated or run. And that's something that we try to do, and possibly one of the reasons why we move towards JavaScript. We also wanted to make sure that the output of that compilation process was machine and serviceable and as close to pure sequel as possible. So the output of the data form completion project actually is just a JSON object containing really nothing but actual SQL statements that can be run directly, either through data form or manually. There are a few steps, step changes, I'd say in the in the framework itself. One particularly interesting and important part was sandbox thing. So when did phone projects compile, we don't allow external calls to third party, so offices, and they can't look at the data warehouse. And this is necessary if you want to make it fast if we allow user code to make external API calls, and this can slow things down quite a lot. Beyond that, I said, two big changes. We've added our support for unit tests for sequel queries, which is, again, one of the software engineering best practices that just hasn't really existed in sequel world, and support for documenting data sets, which allows you to annotate fields and columns within your queries, his descriptions or other meta data.
Tobias Macey
0:24:35
So it sounds like you've actually ended up building a sequel parse or into the command line agent for being able to extract the data types and the dependency graph. I'm curious what are some of the challenges that you encountered in the process of that, and if there were any pre built libraries that you're able to lean on to handle some of that Overall capability.
Unknown
0:25:02
So we don't currently pause sequel itself. One of the challenges there is the different dialects of sequel across different warehouse types. However, we do still have our own custom pasir for the sequel X Files, so the sequel x file is real sequel, but a allows it to be annotated with JavaScript or JSON like blocks, which set configuration settings or allow the use of templates within that sequel. So you can reuse code and make your project more modular. We did use some libraries I think we knew. We're trying to keep the footprint of the data from core library. Smart as possible. One of our engineers is a experiencing this stuff and managed to whip something up in a in a few days, but it's certainly been challenging and I think as I mentioned before, the performance is very important and Trying to keep that lexing and passing fast and under a second is something we've worked quite hard to do.
Tobias Macey
0:26:06
And you mentioned to that in the sequel X Files, you can embed some JavaScript logic for being able to build out macros or conditional template. And I'm curious if you could talk through some of the use cases that that enables that would otherwise be fairly difficult if you were just relying on the pure sequel, and some of the limitations of sequel itself when you're working in an environment where you're collaborating with other engineers, and you might need to be able to compile multiple different fragments of sequel into an overarching query or table definition.
Unknown
0:26:44
Yeah, so sequel is not a programming language. Exactly. Some variants like TC cool maybe are, we can't really make that assumption. So as you mentioned, this conditional babies are hard to recreate and it really limits what you can do. So, to give you talk through an example, unit testing for a sequel query, in order to be able to unit test a sequel query, you need to have an understanding of its inputs, and you need to be able to declare, as expected outputs. This isn't something that sequel can do on itself. And this is where I think product like date form, actually enables these use cases. Similarly, prioritization parameter ization of queries for different environments. I mentioned this earlier, running staging pipelines on 1% of your data. You can't do this with sequel directly, it doesn't understand what environment it's in. But with a framework lightform we can wrap up that information and make that accessible to the user code. So as we create the sequel code, we can put those variables in. The reasonable aspect is obviously important. Don't repeat yourself. If you are writing some actual codes, if and your copy pasting more than a handful of lines, you probably going to think twice about that and package it up. Sequel again doesn't really have many ways to do this. Some data warehouses do support things like store sequel procedures, which can help but these things are really version controlled. So being able to do that inside a framework, like data form, massively reduces the amount of duplicate code makes large refactoring easier. I could give you an example of one thing which one of our first customers did early on, which I think really highlights the power of this kind of thing. So I mentioned earlier GDPR is annoying, because but important, because you need to remove data from your your warehouse you need to do something which is kind of state fold in a sense, execute delete operation. And forget everything about a user, you might have 30 plus tables which contain user, PII or identifiable information, you're going to have to go into those tables and write XQA delete statement removing any rows, which matched user IDs from some deletion request table. So with data form, this template can be packaged up with a for loop and a list of the tables, or, and possibly the user ID field names, you can automatically generate those 30 queries that actually will go and run at the end of your daily schedule to remove all of those users rose from those tables. So none of this stuff will be doable directly with sequel and none of the other things we've talked about CIC D unit testing, code, model, modularity, these are all limitations of sequel itself
Tobias Macey
0:29:59
and then another thing that you mentioned earlier is the different dialects for the different data warehouses. And I'm curious how you have approached encapsulating the common core capabilities, and then also allowing for taking advantage of the differentiating factors between those different warehouses and what's involved in adding support for a new back end.
Unknown
0:30:27
Yep, So currently, we have spot for bakery, redshift, snowflake, Postgres, and as your data warehouses, it It certainly is a cost to us to maintain all of these. Thankfully, they do share a lot of similarities. And, you know, if you look at redshift Postgres, particularly, they are very similar in terms of how they've snowflake fairly similar to be curious, probably the odd one out here and that it has its own way of doing things and definitely causes causes us a bit of a headache. I think One of our decisions in how we interface with these warehouses is to kind of treat them as simple in the most simple way we possibly can, which is, they are something that we send sequel strings to and get responses back from. And that is our internally that is pretty much what our database adapter looks like. If you want to support a new warehouse, you need to tell it a little bit about the sequel dialect. And most of these derived from the same kind of core standard sequel dialects that we have inside our repository. And you need to tell it a how to connect to that warehouse and how to execute statements. And that's really it. That's kind of the fundamental piece, it's quite low level, it doesn't know much more. We do have some API's that need to be implemented such as listing tables, validating statements, if You want validation in the actual date form, whether you I listing schemers and fetching table metadata, such as the schema of a table. There are some headaches, they're dealing with different data types in different databases, some often tricky issues to debug around connection pooling. But I think thanks to keeping that interface, quite simple, we haven't this hasn't been a huge burden for us so far. I suspect some of the other warehouses such as Athena, and presto will make this a little trickier. We're dealing with data on s3, for example, you maybe lose this concept of like a schema table relation identifier.
Tobias Macey
0:32:44
And in terms of the sequel, dialect support, do you have any validation in terms of the compilation step to verify that a particular statement is going to comply with the dialect of the engine that it's going to be running Against?
Unknown
0:33:01
Yes, so we thought a little bit about this, there are some libraries you can use to can do this. At the moment, we don't do that automatically compilation. As I mentioned, we want to keep really fast, we will do validation in the UI after be successfully compile. And we can catch some of that stuff where it makes sense. But we generally rely on asking the warehouse itself for to validate a query for us. And this is particularly important for something like the query where, you know, they've perhaps moved a bit away from your standard dialect. And the source of truth is ultimately asking the Big Query API itself whether this is a valid query. One of the things that was mixed is tricky is understanding the tables that you're creating from the even exist. And it's useful for us to be able to surface that kind of information.
Tobias Macey
0:33:53
And then, as you've mentioned, there is the web UI and there's also the command line interface that Has the core functionality and wondering if you can talk a bit about the differences in capabilities and when somebody might want to reach for the web UI versus the command line.
Unknown
0:34:11
So the digital web platform is really geared towards enabling collaboration across an entire DC. So did form is an ID for data from predicts. You can develop new models, you can develop sequel queries get real time compilation, and preview results. It integrates directly with group gets. So you can create branches, make commits and develop and test changes in an isolated environment. So you're not overwriting your production data every time somebody makes a change. It also wraps up some of the deployment pieces. So continuous delivery, scheduling, and logs in on email, notifications, all this kind of stuff you'd expect with scheduler type system. So I think the command line interface is you Useful for a number of things still. Because every projects in teach form, it's usually get backed by GitHub or get lab repository. It's kind of up to individuals, how much they want to use the ID or not. And we see teams where maybe there's software engineers, or the data engineers are checking out that code themselves and making edits in their preferred ID and using the command line interface to test those changes. Whereas perhaps analysts are less comfortable doing that and would rather work in the web idea that itself where they get a bit more real time feedback or as they develop queries. So I think you can kind of use both together. But the parts of the web platform, which we think are particularly useful are the collaborative pieces and that's probably where we've spent the majority of our time on to make sure that analysts have a place to come and contribute to Projects without having to do things that they may be less comfortable with.
Tobias Macey
0:36:05
And what are the cases where data form is the wrong choice for a given project?
Unknown
0:36:11
Yes. So I guess we talked a bit about how data form compiles projects. The only time user code gets executed and digital project is during compilation and the generation of sequel queries, which means at runtime, we can't actually execute user code. So you can't pull from a third party API, for example, for something like that, you still need to fall back to a system perhaps like airflow, where you can actually run user code at runtime. I'd say the other situation where this form really doesn't make sense is real time streaming. I think that there are a few groups working on making streaming sequel dialects a real thing. And I know these are already been developed for Apache Beam and for calf gear, and this is something that I think we want to think about one day. But right now if you need extremely low latency under, say 510 minutes, which is probably the maximum you'd feasibly want to run a batch system on, then you want to look at doing something a little bit more custom. I think in in practice, for the kind of use cases that we see people applying data on to, this isn't a problem. It's rare that you need sub five minute latency. There are features like incremental table builds, which allow you to kind of do micro bashing in a sense, but for anything faster than that real time dashboards, then it forms probably not the right choice.
Tobias Macey
0:37:50
And then in terms of your experience of building and running the data form project and business, what have you found to be some of the most interesting or Unexpected or challenging lessons that you've learned?
Unknown
0:38:03
Yeah, the Swedish form is a complex product. And it's a complex idea. And I say we have a significant challenge trying to communicate what it is to our users in terms of value proposition, but also in terms of getting them started. So this is something that we've been investing a lot in and continuing to invest in is that initial onboarding experience when you come and sign up, perhaps originally, you were just land you were dropped in an IDP with a MTC, co editor and kind of expected to know what to do. And in practice, that just doesn't work. So we're having to invest a lot more than I think we ever imagined in making that process. smooth and clear to new users and be so do a lot of manual kind of hand holding in that process. As a company, I'm not sure if there were many surprising things. Starting a startup is hard. fundraising is hard. Building a team is hard, but also rewarding. And I think the the team is thoroughly enjoying the problem that we're solving right now.
Tobias Macey
0:39:18
And one thing that I'm not sure if we mentioned explicitly is that the command line piece of data form is open source. And so I'm wondering what your governance policy is around that and how you're approaching the sustainability of that core framework as it relates to the business and the web application that you're building on top of it.
Unknown
0:39:40
Yeah, we think a lot about you know, what the line to draw there is, and I think it's, it's a tough balance to get right. Obviously, we do have a lot of features of the product which are closed source. I think one of the so this has happened. This has happened already where we've had people not from age four Come in and make a change the episodes repo, we want to make sure that they can add the features that they want or their support for their data warehouse without us having to support a number of changes in our proprietary systems to. So that gives can flexibility for people to come and change the very, very cool part of the products while we can still maintain control of the the web platform, and they can still use that for the new features that they developed.
Tobias Macey
0:40:29
And looking forward, what do you have planned for the near term, medium term of data form, both on the technical and business side as far as new capabilities or improvements?
Unknown
0:40:42
Yes, so a lot of us have spent a lot of time in our app, as it is the primary place for development of queries and sequel. So we're doubling down on that. I think in terms of the features we have, you know, we're really supporting that core use case right now and I'd say over the last six to nine months, we've managed to get out most of the key blocking features for new users. I think right now we want to make sure that developing sequel and developing data from projects in the web UI is a really pleasant experience that it's fast at snappy. Getting rid of those edge case bugs, of which there is a never ending stream of. I'd say beyond that, supporting massive data teams. This is really challenging. We have some big clients using our software with 200 300 data models. But we are also speaking to clients with many, many more. And when you start looking at these processes in teams of analysts, which may be 100, or more, managing thousands of sequel queries and extremely complex data pipelines, and I think there are a lot of challenges that still needs to be solved there was trying to think a lot about how we deal with that. Some of these around a more enterprise things fine grained access control data ownership, data governance, column level lineage. This is all extremely important at that level in terms of media features. We're currently building out support for release processes and multiple deployment environments. So you can have staging and Canary environments. We're looking at first class support for UDS, I mentioned UDS before and I think this is this enables a lot of really interesting use cases in modern data warehouses, as well as support for presto and Athena and integration with some other data. workflow tools such as effort in terms of the maybe slightly longer term. I think we'd like to be able to support more today's multiple data warehouses with a single data from project. I mentioned the streaming thing too, that's something that's always on the back of our mind. And we're also investing in a number of packages to kind of make Common sequel processes a little easier to, to do without having to reinvent the wheel every time. So for example, analyzing AB experiments managing access control, or utilities to simplify writing data tests, we think we can save a lot of people, a lot of repeat work by packaging some of these up and making them available to the community.
Tobias Macey
0:43:24
Yeah, having the pre packaged data tests, I'm sure would be quite interesting and useful, especially for cases where you're providing validation for some common data sources where somebody might be pulling from Google Analytics, or maybe Twitter or LinkedIn for being able to ensure that there's an appropriate schema or that you have some sort of common needs that are able to be encapsulated in those pre baked tests.
Unknown
0:43:51
Yeah, so a good example of this is segment. You know, I think a lot of our users use segment and they often have the same set of data models that they are developing to build on top of the core segment data sets, like session eyes, tracks and a page views. And this is quite hairy code. And if we can provide a kind of ready to go tested package so that others can benefit from that, then I think that really builds up the ecosystem and the framework and makes it gives you a lot more value without having to write a lot of sequel yourself.
Tobias Macey
0:44:27
Are there any other aspects of data form and data warehouse management in the context of data engineering and analytics that we didn't discuss yet that you'd like to cover before we close out the show?
Lewis Hemens
0:44:38
No, I think I think that's great. There's nothing else.
Tobias Macey
0:44:41
Well, for anybody who wants to get in touch with you or follow along with the work that you're doing. I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
Unknown
0:44:56
So I say there's still a I think we're bringing a lot of best practices and deployment processes to sequel for those core data models, I think one of the areas where a lot of what we're doing with data form where this doesn't apply is in machine learning, where you still require slightly more custom approaches to building and deploying models. And we've definitely spoken to a lot of people who are having to invest heavily in how they develop and deploy machine learning models in a kind of reproducible way. Again, following the software engineering back best practices, so this is definitely an area that I kind of expect to see a lot of innovation in over the next few years.
Tobias Macey
0:45:41
Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with data form. It's definitely interesting to see that there is more, there's more innovation going on in this space of ensuring repeatable processes and best practices for engineering in the context of data. warehouses and leaning on sequel capabilities so I'm excited to see where you take the business and where the overall space ends up. So thank you for all your efforts on that. And I hope you enjoy the rest of your day.
Lewis Hemens
0:46:11
Great. Thank you very much.
Tobias Macey
0:46:17
Listening. Don't forget to check out our other show podcast.in it at Python podcast.com to learn about the Python language, its community and the innovative ways it is being used and visit the site at data engineering podcast.com Subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects on the show, then tell us about it. Email hosts at data engineering podcast com with your story. And to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!
Keeping Your Data Warehouse In Order 1