Digging Into Data Replication At Fivetran - Episode 93

Summary

The extract and load pattern of data replication is the most commonly needed process in data engineering workflows. Because of the myriad sources and destinations that are available, it is also among the most difficult tasks that we encounter. Fivetran is a platform that does the hard work for you and replicates information from your source systems into whichever data warehouse you use. In this episode CEO and co-founder George Fraser explains how it is built, how it got started, and the challenges that creep in at the edges when dealing with so many disparate systems that need to be made to work together. This is a great conversation to listen to for a better understanding of the challenges inherent in synchronizing your data.

linode-banner-sponsor-largeDo you want to try out some of the tools and applications that you heard about on the Data Engineering Podcast? Do you have some ETL jobs that need somewhere to run? Check out Linode at linode.com/dataengineeringpodcast or use the code dataengineering2019 and get a $20 credit (that’s 4 months free!) to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and Corinium Global Intelligence. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing George Fraser about FiveTran, a hosted platform for replicating your data from source to destination

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing the problem that Fivetran solves and the story of how it got started?
  • Integration of multiple data sources (e.g. entity resolution)
  • How is Fivetran architected and how has the overall system design changed since you first began working on it?
  • monitoring and alerting
  • Automated schema normalization. How does it work for customized data sources?
  • Managing schema drift while avoiding data loss
  • Change data capture
  • What have you found to be the most complex or challenging data sources to work with reliably?
  • Workflow for users getting started with Fivetran
  • When is Fivetran the wrong choice for collecting and analyzing your data?
  • What have you found to be the most challenging aspects of working in the space of data integrations?}}
  • What have been the most interesting/unexpected/useful lessons that you have learned while building and growing Fivetran?
  • What do you have planned for the future of Fivetran?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
0:00:12
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the project to hear about on the show, you'll need somewhere to deploy it. So check out our friends over at the node. With 200 gigabit private networking, scalable shared block storage and the 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got that covered with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances, go to data engineering podcast.com slash node that's LINOD. Today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with what's happening and databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity and the Open Data Science Conference with upcoming events including the O'Reilly AI conference, the strata data conference, and the combined events of the data architecture summit and graph forum. Go to data engineering podcast.com slash conferences to learn more and to take advantage of our partner discounts when you register and go to the site. Its data engineering podcast.com to subscribe to the show, sign up for the mailing list, read the show notes and get in touch and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers your host is Tobias Macey. And today I'm interviewing George Fraser about five Tran a platform for shipping your data to data warehouses in a managed fashion. So George, can you start by introducing yourself?
0:01:54
Yeah, my name is George. I am the CEO of five Tran. And I was one of two co founders of five trend almost seven years ago when we started.
0:02:02
And do you remember how you first got involved in the area of data management?
0:02:05
Well, before five train, I was actually a scientist, which is a bit of an unusual background for someone in data management, although it was sort of an advantage for us that we were coming at it fresh and so much has changed in the area of data management, particularly because of the new data warehouses that are so much faster and so much cheaper and so much easier to manage than the previous generation, that a fresh approach is really merited. And so in a weird way, the fact that none of the founding team had a background in data management was kind of an advantage.
0:02:38
And so can you start by describing it about describing a bit about the problem that five Tran was built to solve and the overall story of how it got started, and what motivated you to build a company around it?
0:02:50
Well, I'll start with the story of how it got started. So in late 2012, when we started the company, Taylor and I, and then Mel, who's now our VP of engineering, who joined early in 2013, five turn was originally a vertically integrated data analysis tool. So it had user interface that was sort of a super powered spreadsheets slash BI tool, it had a data warehouse on the inside, and it had a data pipeline that was feeding the data warehouse. And through many iterations of that idea, we discovered that the really valuable thing we had invented was actually the data pipeline that was part of that. And so we threw everything else away, and the data pipeline became the product. And the problem that five trans solves, is the problem of getting all your company's data in one place. So companies today use all kinds of tools to manage their business. You use CRM systems, like Salesforce, you use payment systems, like stripe support systems like Zendesk finance systems like QuickBooks, or Zora, you have a production database somewhere, maybe you have 20 production databases. And if you want to know what is happening in your business, the first step is usually to synchronize all of this data into a single database, where an analyst can query it, and where you can build dashboards and BI tools on top of it. So that's the primary problem that five trend solves people use by trying to do other things. Sometimes they use the data warehouse that We're sinking to as a production system. But the most common use case is they're just trying to understand what's going on in their business. And the first step in that is to sync all of that data into a single database.
0:04:38
And in recent years, one of the prevalent approaches for being able to get all the data into one location for being able to do analysis across it is to dump it all into a data lake because of the fact that you don't need to do as much upfront schema management or data cleaning. And then you can experiment with everything that's available. And I'm wondering what your experience has been as far as the contrast between loading everything into a data warehouse for that purpose versus just using a data lake.
0:05:07
Yeah. So in this area, I think that sometimes people present a bit of a false choice between you can either set up a data warehouse do full on Kimball dimensional schema, data modeling, and Informatica with all of the upsides and downsides that come with that, or you can build a data lake, which is like a bunch of JSON and CSV files in s3. And I say false choice, because I think the right approach is a happy medium, where you don't go all the way to sticking raw JSON files and CSV files in s3, that's really unnecessary. Instead, you use a proper relational data store. But you exercise restraint, and how much normalization and customization you do on the way in. So you say, I'm going to make my first goal to create an accurate replica of all the systems in one database, and then I'm going to leave that alone, that's going to be my sort of staging area, kind of like my data lake, except it lives in a regular relational data warehouse. And then I'm going to build whatever transformations I want to do have that data on top of that data lake schema. So another way of thinking about it is that I am advising that you should take a data lake type approach, but you shouldn't make your data lake a separate physical system. Instead, your data lake should just be a different logical system within the same database that you're using to analyze all your data. And to support your BI tool. It's just a higher productivity simpler workflow to do it that way.
0:06:47
Yeah. And that's where the current trends towards moving the transformation step until after the data loading into the LT pattern has been coming. Because of the flexibility of these cloud data warehouses that you've mentioned, as far as being able to consume semi structured and unstructured data while still being able to query across it and introspective for the purposes of being able to join with other information that's already within that system.
0:07:11
Yeah, the LT pattern is really a just a great way to get work done. It's simple. It allows you to recover from mistakes. So if you make a mistake in your transformations, and you will make mistakes in your transformations, or even if you just change your mind about how you want to transform the data. The great advantage of the LT pattern is that the original untransformed data is still sitting there side by side in the same database. So it's just really easy to iterate in a way that it isn't. If you're transforming the data on the fly, or even if you have a data lake where you like store the API responses from all of your systems, that's still more complicated than if you just have this nice replica sitting in its own schema in your data warehouse.
0:07:58
And so one of the things that you pointed out is needing to be able to integrate across multiple different data sources that you might be using within a business. And you mentioned things like Salesforce for CRM, or things like ticket tracking, and user feedback, such as Zendesk, etc. And I'm wondering what your experience has been as far as being able to map the sort of logical entities across these different systems together to be able to effectively join and query across those data sets, given that they don't necessarily have the shared sense of truth for things like how customers are presented, or even what these sort of common field names might be to be able to map across those different, those different entities.
0:08:42
Yeah, this is a really important step. And the first thing we always advise our customers to do. And even anyone who's building a data warehouse, I would advise to do this is that you need to keep straight in your mind that there's really two problems here. The first problem is replicating all of the data. And the second problem is analyzing all the data into a single schema. And you need to think of these as two steps, you need to follow proper separation of concerns, just as you would in a software engineering project. So we really focus on that first step on replication. What we have found is that the approach that works really well for our customers for rationalizing all the data into a single schema is to use sequel, sequel is a great tool for unionizing things, joining things, changing field names, filtering data, all the kind of stuff you need to do to rationalize a bunch of different data sources into a single schema, we find the most productive way to do that is to use a bunch of sequel queries that run inside your data warehouse.
0:09:44
And do you have your own tooling and interfaces for being able to expose that process to your end users? Or do you also integrate with tools such as DBT, for being able to have that overall process controlled by the end user. So
0:10:00
we originally did not do anything in this area other than give advice, and we got the advantage that we got to sort of watch what our users did in that context. And what we saw is that a lot of them set up cron to run sequel scripts on a regular schedule. A lot of them used liquor, persistent Dr. Tables, some people use airflow, they used air flow, and kind of a funny way, they didn't really use the Python parts of air flow, they just use their flow as a way to trigger sequel. And when DVD came out, we have a decent community of users who use DBT. And we're supportive of whatever mechanism you want to use to transform your data, we do now have our own transformation tool built into our UI. And it's the first version that you can use right now. It's basically a way that you can provide the sequel script, and you can trigger that sequel script, when five Tran delivers new data to your tables. And we've got lots of people using the first version of that that's going to continue to evolve over the rest of this year, it's going to get a lot more sophistication. And it's going to do a lot more to give you insight into the transforms that are running, and how they all relate to each other. But the core idea of it is that sequel is the right tool for transforming data.
0:11:19
And before we get too far into the rest of the feature set and capabilities of five Tran, I'm wondering if you can talk about how the overall system is architected, and how the overall system design has evolved since you first began working on it.
0:11:33
Yeah, so the overall architecture is fairly simple. The hard part of five trend is really not the sort of high class data problems, things like queues and streams and giant data sets flying around. The hard part of five trend is really all of the incidental complexity of all of these data sources, understanding all the small sort of crazy rules that every API has. So most of our effort over the years has actually been devoted to hunting down all these little details of every single data source we support. And that's what makes our product really valuable. The architecture itself is fairly simple. The original architecture was essentially a bunch of EC two instances, with cron, running a bunch of Java processes that were running on a on a fast batch cycle, sinking people's data. Over the last year and a half, the engineering team has built a new architecture based on Kubernetes. There are many advantages of this new architecture for us internally, the biggest one is that it auto scales. But from the outside, you can't even tell when you migrate from the old architecture to the new architecture other than you have to whitelist a new set of IPS. So the you know, it was a very simple architecture. In the beginning, it's gotten somewhat more complex. But really, the hard part of five train is not the high class data engineering problems. It's the little details of every data source, so that from the users perspective, you just get this magical replica of all of your systems in a single database.
0:13:16
And for being able to keep track of the overall health of your system and ensure that data is flowing from end to end for all of your different customers. I'm curious what you're using for monitoring and alerting strategy and any sort of ongoing continuous testing, as well as advanced unit testing that you're using to make sure that all of your API interactions are consistent with what is necessary for the source systems that you're working with?
0:13:42
Yeah, well, first of all, there's several layers to that the first one you is actually the testing that we do on our end to validate that all of our sink strategies, all those little details I mentioned a minute ago are actually working correctly, our testing problem is quite difficult, because we interoperate with so many external systems. And in many cases, you really have to run the tests against the real system for the test to be meaningful. And so our build architecture is actually one of the more complex parts of five train, we use a build tool called Bazell. And we've done a lot of work, for example, to run all of the databases and FTP servers and things like that, that we have to interact with in Docker containers so that we can actually produce reproducible Ed tests. So that actually is one of the more complex engineering problems at five trend. And if that sounds interesting to you, I encourage you to apply to our engineering team, because we have lots more work to do on that. So that's the first layer is really all of those tests that we run to verify that our sync strategies are correct. The second layer is that, you know, is it working in production is the customers data actually getting sick and as a getting synced correctly, and one of the things we do there that may be a little unexpected to people who are accustomed to building data pipelines themselves is all five trans data pipelines are typically fail fast. That means if anything unexpected happens, if we see, you know, some event from an API endpoint that we don't recognize, we stop. Now, that's different than when you build data pipelines yourself, when you build data pipelines for your own company, usually, you will have them try to keep going no matter what. But five train is a fully managed service. And we're monitoring and all the time. So we tend to make the opposite choice of anything suspicious is going on, the correct thing to do is just stop and alert five Tran, hey, go check out this customers data pipeline, what the heck is going on? Something unexpected happen is happening. And we should make sure that our sync strategies are actually correct. And then that brings us to the last layer of this, which is alerting. So when data pipelines fail, we get alerted and the customer gets alerted at the same time. And then we communicate with the customer. And we say hey, we may need to go in and check something Do I have permission to go, you know, look at what's going on in your data pipeline in order to figure out what's going wrong, because five trained as a fully managed service. And that is critical to making it work. When you do we do and you say we are going to take responsibility for actually creating an accurate replica of all of your systems in your data warehouse. That means you're signing on to comprehend and fix every little detail of every data source that you support. And a lot of those little details only come up in production when some customer shows up. And they're using a feature of Salesforce that Salesforce hasn't sold for five years, but they've still got it. And you've never seen it before. Some of a lot of those little things only come up in production. The nice thing is that that set of little things, well, it is very large, it is finite. And we only have to discover each problem once and then every customer thereafter. benefits from that. Thanks
0:17:00
for the system itself. One of the things that I noticed while I was reading through the documentation, and the feature set is that for all of these different source systems, you provide automated schema normalization. And I'm curious how that works. And the overall logic flow that you have built in, if it's just a static mapping that you have, for each different data source, are there some sort of more complex algorithm that's going on behind the scenes there, as well as how that works for any sort of customized data sources, such as application databases that you're working with, or maybe just JSON feeds or event streams?
0:17:38
Sure. So the first thing you have to understand is that there's really two categories of data sources in terms of schema normalization. The first category is databases, like Oracle, or my sequel, or Postgres, and database, like systems like NetSuite is really basically a database when you look at the API. So Salesforce, there's a bunch of systems that basically look like David bases, they have arbitrary tables, columns, you can set any types you want in any column, what we do with those systems is we just create an exact one to one replica of the source schema, it's really as simple as that. So there's a lot of work to do, because the change feeds are usually very complicated from those systems. And it's very complex. To turn those change feeds back into the original schema, but it is automated. So for databases and database, like skeet systems, we just produce the exact same schema and your data warehouse as it was in the source for apps are things like stripe, or Zendesk or GitHub or JIRA, we do a lot of normalization of the data. So tools like that, when you look at the API responses, the API responses are very complex and nested, and usually very far from the original normalized schema that this data probably lived in, in the source database. And every time we add a new data source of that type, we study the data source, we, I joke that we reverse engineer the API, we basically figure out what was the schema and the database that this originally was, and we unwind all the API responses back into the normalized schema. These days, we often just get an engineer at the company that is that data source on the phone and ask them, you know, what is the real schema here, we can find we found that we can save ourselves a whole lot of work by doing that. But the the goal is always to produce a normalized schema in the data warehouse. And the reason why we do that is because we just think, if we put in that work up front to normalize the data in your data warehouse, we can save every single one of our customers a whole bunch of time, traipsing through the data, trying to figure out how to normalize that. So we figure it's worthwhile for us to put the effort in up front so our customers don't have to.
0:20:00
One of the other issues that comes up with normalization. And particularly for the source database systems that you're talking about is the idea of schema drift, when new fields are added or removed, or a data types change, or the overall sort of the sort of default data types change. And we're wondering how you manage schema drift overall, in the data warehouse systems that you're loading into well, preventing data loss, particularly in the cases where a column might be dropped, or the data type changed.
0:20:29
Yeah, so it's, it's, there's a core pipeline that all five trend, connectors, databases, apps, everything is written against that we use internally. And all of the rules of how to deal with schema drift are encoded there. So some cases are easy. Like, if you drop a column, then that data just isn't arriving anymore, we will leave that column in your data warehouse, we're not going to delete it in case there's something important in it, you can drop it in your data warehouse, if you want to, we're not going to, if you add a column, again, that's pretty easy. We add a column and your data warehouse, all of the old rows will have nodes in that column, obviously, but then going forward, we will populate that column. The tricky cases are when you change the types. So when you when you alter the type of an existing column that can be more difficult to deal with. Now, we will actually, there's two principles we follow. First of all, we're going to propagate that type change to your data warehouse. So we're going to go and change the type of the column in your data warehouse to fit the new data. And the second principle we follow is that when you change types, sometimes you sort of contradict yourself. And we follow the rule of subtitling in in handling that, if you think back to your undergraduate computer science classes, this is the good old concept of subtypes, for example, and into the subtype of a real a real is a subtype of a string, etc. So we, we look at all the data passing through the system, and we infer what is the most specific type that can contain all of the values that we have seen. And then we alter the data warehouse to be that type, so that we can actually fit the data into the data warehouse.
0:22:17
Another capability that you provide is Change Data Capture for when you're loading from these relational database systems into the data warehouse. And that's a problem space that I've always been interested in as far as how you're able to capture the change logs within the data system, and then be able to replay them effectively to reconstruct the current state of the database without just doing a straight SQL dump. And I'm wondering how you handle that in your platform?
0:22:46
Yeah, it's very complicated. Most people who build in house data pipelines, as you say, they just do a dump and load the entire table, because the change logs are so complicated. And the problem with dumping load is that it requires huge bandwidth, which isn't always available, and it takes a long time. So you end up running it just once an hour if you're lucky, but for a lot of people once a day. So we do Change Data Capture, we read the change logs of each database, each database has a different change log format, most of them are extremely complicated. If you look at the MySQL change log format, or the Oracle change log format, it is like going back in time to the history of MySQL, you can sort of see every architectural change in MySQL in the change log format the answer to how we do that there's, there's no trick. It's just a lot of work, understanding all the possible corner cases of these chains lugs, it helps that we have many customers with each database. So the unlike when you're building a system just for yourself, because we're building a product, we have lots of MySQL users, we have lots of Postgres users. And so over time, we see all the little corner cases, and you eventually figure it out, you eventually find all the things and you get a system that just works. But the short answer is there's really no trick. It's just a huge amount of effort by the databases team at five trend, who at this point, has been working on it for years with, you know, hundreds of customers. So at this point, it's you know, we've invested so much effort in tracking that all those little things, there's just like no hope that you could do better yourself, building a change the reader just for your own company
0:24:28
for the particular problem space the year and you have a sort of many too many issue where you're dealing with a lot of different types of data sources, and then you're loading it into a number of different options for data warehouses. And on the source side, I'm wondering what you have found to be some of the most complex or challenging sources to be able to work with reliably and some of the strategies that you have found to be effective for picking up a new source and being able to get it production ready in the shortest amount of time.
0:24:57
Yeah, it's funny, you know, if you ask any engineer, five Randall, they can all tell you what the most difficult data sources are, because we've had to do so much work on on them over the years. Undoubtedly, the most difficult data sources is Mark Hedo, close seconds or JIRA, Asana and then probably NetSuite. So those API's, they just have a ton of incidental complexity, it's really hard to get data out of them fast. We're working with some of these sources to try to help them improve their API's to make it easier to do replication, but there there's a handful of data sources that have required disproportionate work to to get them working reliably. In general, one funny observation that we have seen over the years is that the companies with the the best API's tend to unfortunately be the least successful companies. It seems to be a general principle that companies which have really beautiful well, organized API's tend to not be very successful businesses, I guess, because they're just not focused enough on sales or something. We've seen it time and again, where we integrate a new data source, and we look at the API and we go, man, this API is great. I wish you had more customers so that we could sink for them. The one exception, I would say is stripe, which has a great API, and is a highly successful company. And that's probably because their API is their products. So there's there's definitely a spectrum of difficulty. In general, the oldest largest companies have the most complex API's,
0:26:32
I wonder if there's some reverse incentive where they make their API's obtuse and difficult to work with, so that they can build up an ecosystem around them of contractors who are whose sole purpose is to be able to integrate them with other systems.
0:26:46
You know, I think there's a little bit of that, but less than you would think. For example, the company that has by far the most extensive ecosystem of contractors, helping people integrate their tool with the other systems is Salesforce. And Salesforce is API is quite good. Salesforce is actually one of the simpler API is out there. It was harder a few years ago when we first implemented it. But they made a lot of improvements. And it's actually one of the better API's now.
0:27:15
Yeah, I think that's probably coming off the tail of their acquisition of MuleSoft to sort of reformat their internal systems and data representation to make it easier to integrate. Because I know beforehand, it was just a whole mess of XML.
0:27:27
You know, it was really before the meal soft acquisition that a lot of the improvements in the Salesforce API happened, the Salesforce REST API was I was pretty well structured and rational, five years ago, it would fail a lot, you would send queries and they would just not return when you had really big data sets, and now it's more performance. So I think it predates the Millsap acquisition, they just did the hard work to make all the corner cases work reliably and scale the large data sets and, and Salesforce is now one of the easier data sources to actually think there are certain objects that have complicated rules. And I think the developers at five train who work on Salesforce will get mad at me when they hear me say this. But compared to like NetSuite, it's, it's pretty great.
0:28:12
On the other side of the equation, where you're loading data into the different target data warehouses, I'm wondering what your strategy is, as far as being able to make the most effective use of the feature sets that are present, or do you just target the lowest common denominator of equal representation for being able to load data in and then leave the complicated aspects of it to the end user for doing the transformations and analyses.
0:28:36
So most of the code for doing the load side is shared between the data warehouses, the differences are not that great between different destinations, except Big Query Big Query is a little bit of a unusual creature. So if you look at five trans code base, there's actually a different implementation for Big Query that shares very little with all of the other destiny. So the differences between destinations are not that big of a problem for us, there are certain things that that do, you know, there's functions that have to be overwritten for different destinations for things like the names of types and, and there's some special cases around performance where our load strategies are slightly different, for example, between snowflake and redshift, just to get faster performance. But in general, that actually is the easier side of the business is the destinations. And then in terms of transformations, it's really up to the user to write the sequel that transforms their data. And it is true that to write effective transformations, especially incremental transformations, you always have to use the proprietary features of the particular database that you're working on.
0:29:46
On the incremental piece, I'm interested in how you address that for some of the different source systems, because for the databases, where you're doing Change Data Capture, it's fairly obvious that you can take that approach for a data loading. But for some of the more API oriented systems, I'm wondering if there are if there's a high degree of variability of being able to pull in just the objects that have changed since a certain last sync time, or if there are a number of systems that will just give you absolutely everything every time and then you have to do the thing on your side,
0:30:20
the complexity of those dangers. I know I mentioned this earlier, but it is it is staggering. But yes, I'm the API side, we're also doing Change Data Capture of apps, it is different for every app, but just about every API we work with provides some kind of change feed mechanism. Now it is complicated, you often end up in a situation where the API will give you a change feed that's incremental, but then other endpoints are not incremental. So you have to do this thing where you read the change feed, and you look at the individual events and the change feed, and then you go look up the related information from the other entity. So you end up dealing with a bunch of extra complexity because of that. But as with all things at five train, we have this advantage that we have many customers with each data source. So we can, we can put in that disproportionate effort that you would never do if you were building it just for yourself to make the change capture mechanism work properly, because we just have to do it once and then everyone who uses that data source can benefit from it.
0:31:23
For people who are getting on boarded onto the five trans system. I'm curious what the overall workflow looks like as far as the initial setup, and then what their workflow looks like, as they're adding new sources or just interacting with their five trading account for being able to keep track of the overall health of their system, or if it's largely just fire and forget, and they're only interacting with the data warehouse at the other side,
0:31:47
it's pretty simple. The joke at five trend is that our demo takes about 15 seconds. So because we're so committed to automation, and we're so committed to this idea that five trends fundamental job is to replicate everything into your data warehouse, and then you can do whatever you want with it, it means that there's very little UI, the process of setting up a new data source is basically Connect source, which for many sources is as simple as just going through an OAuth redirect, and you just click you know, yes, 510 is allowed to access my data. And that's it. And connect destination which, which now we're actually integrated with snowflake and big queries, you can just push a button in snowflake or in Big Query and create a five train account that's pre connected to your data warehouse. So the setup process is really simple. There's once after setup, there's a bunch of UI around monitoring what's happening, we like to say that five Tran is a glass box, it was originally a black box. And now it's it's a glass box, you can see exactly what it's doing. You can't change it. But you can see exactly what we're doing at all times. And you know, part of that is in the UI. And part of that is an emails you get when things go wrong and or the sink finishes for the first time, that kind of thing.
0:33:00
Part of that visibility, I also noticed that you will ship the transaction logs to the end users log aggregation system. And I thought that was an interesting approach, as far as being able to give them away to be able to access all of that information in one place without having to go to your platform just for that one off case of trying to see what the transaction logs are and gain that extra piece of visibility. So I'm wondering what types of feedback you've got from users as far as the overall visibility into your systems and the ways that they're able to integrate it into their monitoring platforms?
0:33:34
Yeah, so the logs we're talking about are the logs of every action five train took like five drain made this API call against Salesforce five ran ran this log minor query against Oracle. And so we record all this metadata about everything we're doing. And then you can see that in the UI, but you can also ship that to your own logging system like cloud watch or stack driver, because a lot of companies have like a in the same way, they have a set centralized data warehouse, they have a centralized logging system. It's mostly used by larger companies, those are the ones who invest the effort in setting up those centralized logging systems. And it's actually the system we built first, before we built it into our own UI. And later, we found it's also important just to have it in our own UI, just there's a quick way to view what's going on. And, yeah, I think people have appreciated that we're happy to support the systems they already have, rather than try to build our own thing and force you to use that.
0:34:34
I imagine that that also plays into efforts within these organizations for being able to track data lineage and provenance for understanding the overall lifecycle of their data as it spans across different systems.
0:34:47
You know, that's not so much of a logging problem, that's more of like a metadata problem inside the data warehouse, when you're trying to track lineage to say, like, this row in my data warehouse came from this transformation, which came from these three tables, and these tables came from Salesforce, and it was connected by this user, and it synced at this time, etc. that lineage problem is really more of a metadata problem. And that's kind of a Greenfield in our area right now. There's a couple different companies that are trying to solve that problem. We're doing some interesting work on that in conjunction with our transformations. I think it's a very important problem. It's still still a lot of work to be done there.
0:35:28
So on the sales side of things to I know, you said that your demo is about 15 seconds as far as Yes, you just do this, this and then your data is in your data warehouse. But I'm wondering what you have found to be some of the common questions or common issues that people have that bring them to you as far as evaluating your platform for their use cases. And just some of the overall user experience design that you've put into the platform as well, to help ease that onboarding process.
0:35:58
Yeah, so a lot of the discussions in the sales process really revolve around that ELT philosophy of five train is going to take care of replicating all of your data, and then you're going to cure curated non-destructively using sequel, which for some people just seems like the obvious way to do it. But for others, this is a very shocking proposition, this idea that your data warehouse is going to have this comparatively and curated schema, that five trend is delivering data into and then you're basically going to make a second copy of everything. For a lot of people who've been doing this for a long time. That's a very surprising approach. And so a lot of the discussion and sales are rolls around the trade offs of that and why we think that's the right answer for the data warehouses that exists today, which are just so much faster, and so much cheaper, that it makes sense to adopt that more human friendly workflow than maybe it would have in the
0:36:52
90s. And what are the cases where five trend is the wrong choice for being able to replicate data or integrated into it data warehouse?
0:37:00
Well, if you already have a working system, you should keep using it. So I would we don't advise people to change things just for the sake of change. If you've set up, you know, a bunch of Python scripts that are sinking all your data sources, and it's working, keep using it, what usually happens that causes people to take out a system is schema changes, death by 1000 schema changes. So they find that the data sources upstream are changing, their scripts that are sinking, their data are constantly breaking, it's this huge effort to keep them alive. And so that's the situation where prospects will abandon existing system and adopt five trend. But what I'll tell people is, you know, if your schema is not changing, if you're not having to go fix your these pipes every week, don't change it, just just keep using it.
0:37:49
And as far as the overall challenges or complexities of the problem space that you're working with, I'm wondering what you have found to be some of the most difficult overcome, or some of the ones that are most noteworthy and that you'd like to call out for anybody else who is either working in this space or considering building their own pipeline from scratch.
0:38:11
Yeah, you know, I think that when we got our first customer in 2015, sinking Salesforce to redshift, and two weeks later, we got our second customer thinking Salesforce and HubSpot and stripe into redshift, I sort of imagined that this sync problem was like going to be we were going to have this solved in a year. And then we would go on and build a bunch of other related tools. And the sink problem is much harder than it looks at first, getting all the little details, right. So that it just works is an astonishingly difficult problem. It, it is a parallel parallel problem, you can have lots of developers working on different data sources, figuring out all those little details, we have accumulated general lessons that we've incorporated and adore core code. So we've gotten better at doing this over the years. And it really works when you have multiple customers who have each data source. So it works a lot better as a product company than as someone building an in house data pipeline. But the level of complexity associated with just doing replication correctly, was kind of astonishing for me. And I think it is astonishing for a lot of people who try to solve this problem, you know, you look at the API docs have a data source, and you figure Oh, I think I know how I'm going to sync this. And then you go into production with 10 customers. And suddenly, you find 10 different corner cases that you never thought of that are going to make it harder than you expected to sink the data. So the the level of difficulty of just that problem is kind of astonishing. But the value of solving just that problem is also kind of astonishing.
0:39:45
on both the technical and business side, I'm also interested in understanding what you have found to be as far as the most interesting or unexpected or useful lessons that you've learned in the overall process of building and growing five Tran?
0:39:59
Well, I've talked about some of the technical lessons in terms of you know, just solving that problem really well as is both really hard and, and really valuable. In terms of the business lessons we've learned. It's, you know, growing the company is like a co equal problem to growing the technology, I've been really pleased with how we've made a place where people seem to genuinely like to work, where a lot of people have been able to develop their careers in different ways different people have different career goals. And you need to realize that as someone leading a company, not everyone at this company is like myself, they have different goals that they want to accomplish. So that that problem of growing the company is just as important. And just as complex as solving the technical problems and growing the product and growing the sales side and helping people find out that you have this great product that they should probably be using. So I think that has been a real lesson for me over the last seven years that we've been doing this now for the future of five trend, what do you have planned both on the business roadmap, as well as the feature sets that you're looking to integrate into five Tran and just some of the overall goals that you have for the business as you look forward?
0:41:11
Sure. So
0:41:12
some of the most important stuff we're doing right now is on the sales and marketing side, we have done all of this work to solve this replication problem, which is very fundamental and very reusable. And I like to say no one else should have to deal with all of these API's. Since we have done it, you should not need to write a bunch of Python scripts to sink your data or configure Informatica or anything like that. And we've done it once so that you don't have to, and I guarantee you, it will cost you less to buy five trend than to have your own team basically building a house data pipeline. So we're doing a lot of work on the sales and marketing side just to get the word out that five trends out there. And that might be something that's really useful to you on the product side, we are doing a lot of work now in helping people manage those transformations in the data warehouse. So we have the first version of our transformations tool in our product, there's going to be a lot more sophistication getting added to that over the next year, we really view that as the next frontier for five trend is helping people manage the data after we've replicated that,
0:42:17
are there any other aspects of the five train company and technical stack or the overall problem space of data synchronization that we didn't touch on that you'd like to cover before we close out the show?
0:42:28
I don't think so I think the the thing that people tend to not realize because they tend to just not talk about it as much is that the real difficulty in this space is all of that incidental complexity of all the data sources. The you know, Kafka is not going to solve this problem for you. spark is not going to solve this problem for you. There is no fancy technical solution. Most of the difficulty of the data centralization problem is just in understanding and working around all of the incidental complexity of all these data sources.
0:42:58
For anybody who wants to get in touch with you or follow along with the work that you and five Tran are doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
0:43:15
Yeah, I think that the biggest gap right now is in the tools that are available to analysts who are trying to curate the data after it arrives. So writing all the sequel that curates the data into a format that's ready for the business users to attack with BI tools is a huge amount of work, it remains a huge amount of work. And if you look at the workflow of the typical analysts, they're writing a ton of sequel. And they're using tools that it's a very analogous problem to a developer writing code using Java or C sharp, but the tools that analysts have to work with look like the tools developers had in like the 80s. I mean, they don't even really have autocomplete. So I think that is a really under invested then problem, just the tooling for analysts to make them more productive in the exact same way. As we've been building tooling for developers over the last 30 years. A lot of that needs to happen for analysts to and I think it hasn't happened yet.
0:44:13
Well, thank you very much for taking the time today to join me and discuss the work that you've been doing it five Tran and some of the insights that you've gained in the process. It's definitely an interesting platform and an interesting problem space and I can see that you're providing a lot of value. So I appreciate all of your efforts on that front and I hope Enjoy the rest of your day.
0:44:31
Thanks for having me on.