Data Infrastructure Automation For Private SaaS At Snowplow - Episode 120

Summary

One of the biggest challenges in building reliable platforms for processing event pipelines is managing the underlying infrastructure. At Snowplow Analytics the complexity is compounded by the need to manage multiple instances of their platform across customer environments. In this episode Josh Beemster, the technical operations lead at Snowplow, explains how they manage automation, deployment, monitoring, scaling, and maintenance of their streaming analytics pipeline for event data. He also shares the challenges they face in supporting multiple cloud environments and the need to integrate with existing customer systems. If you are daunted by the needs of your data infrastructure then it’s worth listening to how Josh and his team are approaching the problem.

linode-banner-sponsor-largeDo you want to try out some of the tools and applications that you heard about on the Data Engineering Podcast? Do you have some ETL jobs that need somewhere to run? Check out Linode at dataengineeringpodcast.com/linode or use the code dataengineering2019 and get a $20 credit (that’s 4 months free!) to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Josh Beemster about how Snowplow manages deployment and maintenance of their managed service in their customer’s cloud accounts.

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by giving an overview of the components in your system architecture and the nature of your managed service?
  • What are some of the challenges that are inherent to private SaaS nature of your managed service?
  • What elements of your system require the most attention and maintenance to keep them running properly?
  • Which components in the pipeline are most subject to variability in traffic or resource pressure and what do you do to ensure proper capacity?
  • How do you manage deployment of the full Snowplow pipeline for your customers?
    • How has your strategy for deployment evolved since you first began Soffering the managed service?
    • How has the architecture of the pipeline evolved to simplify operations?
  • How much customization do you allow for in the event that the customer has their own system that they want to use in place of one of your supported components?
    • What are some of the common difficulties that you encounter when working with customers who need customized components, topologies, or event flows?
      • How does that reflect in the tooling that you use to manage their deployments?
  • What types of metrics do you track and what do you use for monitoring and alerting to ensure that your customers pipelines are running smoothly?
  • What are some of the most interesting/unexpected/challenging lessons that you have learned in the process of working with and on Snowplow?
  • What are some lessons that you can generalize for management of data infrastructure more broadly?
  • If you could start over with all of Snowplow and the infrastructure automation for it today, what would you do differently?
  • What do you have planned for the future of the Snowplow product and infrastructure management?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:11
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the project to hear about on the show, you'll need somewhere to deploy them. So check out our friends at linode. With 200 gigabit private networking, scalable shared block storage, 40 gigabit public network fast object storage and a brand new managed Kubernetes platform you'll get everything you need to run a fast, reliable and bulletproof data platform. And for your machine learning workloads. They've got dedicated CPU and GPU instances. Go to data engineering podcast.com slash linode. That's l i n od e today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of this show. And setting up and managing a data warehouse for your business analytics is a huge task integrated real time data makes it even more challenging, but the insights you obtain can make or break your business growth. You deserve a data warehouse engine that outperforms the demands of your customers and simplifies your operations at a fraction of the time and costs you might expect. You deserve click house the open source analytical database that deploys and scales wherever and whenever you want it to end turns data into actionable insights. And I'll tennety the leading software and service provider for click house is on a mission to help data engineers and DevOps managers tame their operational analytics. Go to data engineering podcast.com slash l tennety. That's a lt INITY for a free consultation to find out how they can help you today. New listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet, listen and learn from your peers. You don't want to miss out on this year's conferences and we have part Well organizations such as O'Reilly Media trinium, global intelligence, od sc and data Council. Upcoming events include the software architecture conference in New York strata data in San Jose and pi con us and Pittsburgh. Go to data engineering podcast.com slash conferences to learn more about these and other events and to take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey. And today I'm interviewing Joshua, Beemster about how snowplow manages deployment and maintenance of their managed service to their customers cloud accounts. So Josh, can you start by introducing yourself?
Josh Beemster
0:02:33
Sure. Pleasure to be here. My name is Josh and I'm the technical operations lead at snowplow. I've been heading up that role for the last four years working with snowplow for the last five years. I'm responsible for all the cloud infrastructure and maintenance costs. Currently 150 plus hundred 50 plus clients
Tobias Macey
0:02:53
and Do you remember how you first got involved in the area of data management? It was it was a little bit
Josh Beemster
0:02:58
by accident, honestly. I started at snowplough as a, as a engineer, and kind of expressed quite interesting automation and kind of getting rid of repetitive tasks that kind of naturally moved into infrastructure management. And yeah, just kind of kind of went from there.
Tobias Macey
0:03:13
And so can you start by giving a bit of an overview of the overall components in the system architecture of snowplow, and a bit about the nature of how you deploy and maintain the managed service that you offer?
Josh Beemster
0:03:26
Sure. So I think this one's probably best to start with what the nature of the managed services first before before jumping into into system architecture. So what we offer is essentially, what we've coined as private SAS. So what we mean by that is that it is a fully managed service, but it's isolated in a client's own own sub account. So essentially, what that means is that each client comes to us and gives us, you know, their own sub account their own Google Cloud project, and we set up and maintain a full data pipeline stuck within that sub accounts at Every client has their own isolated infrastructure entirely segmented from every other client. So it's SAS and that we manage everything end to end. And we're responsible for all of the running of it, but it's very much not SAS and that there's no shared tenancy across cross anything there. That's obviously quite a quite a difficult, difficult thing to manage, in terms of what that means from a system architecture. On our side, we have kind of, we have a lot of tooling that we've built to manage that, which leverages heavily D the hockey stack. So we're using a lot of terraform to create all of our infrastructures code. We've got hashCode console to manage all of our metadata and volta manager secrets and then Nomad to do that widespread distribution of deploying all of that all of that infrastructure. So I guess it's it's different in that sense. There's some parallels, I guess, with companies that would offer you to buy a license to their software, and that you would go away and deploy it yourself. We I guess, take that a step further. Where you're not only buying the license, you're buying the whole the whole stock experience.
Tobias Macey
0:05:05
And for people who want to dig deeper into what snowplowing itself is, I'll add a link to the interview that I did with your co founder, Alex. But at a high level, it's a event data management platform for being able to replace things like Google Analytics or segment. And so in terms of the private SAS nature of your product, what are some of the challenges that are inherent to that deployment model that you're trying to overcome with some of the tooling and automation that you've built out?
Josh Beemster
0:05:32
So the obvious one is that rather than having one big data pipeline to manage, we have 150 of these and rather than being in one or two or three regions across the world, we're in kind of every region of the world. So there's there's difficulties first off of just the sheer number of servers that we are responsible for numbering in the 10s of thousands and with quite a small SRV team. That is a big big challenge in and of itself with you know, threats like Now at the start of last year, we had meltdown and specter and all of these kind of scary, scary security concerns, were suddenly going well, we're responsible for the servers we have to go in, have to go and manage that. So that's a big challenge for us in terms of staying on top of all of these systems and making sure that they're always always up to date. The other side of it, though, is just how much automation we need to build into it. There can't be anything that really requires any manual interaction or any manual intervention, we have to go top to bottom, everything must be self healing, everything must be able to automatically recover. Because otherwise we just can't can't scale our operation at all specifically, though, on kind of managing clients in this context is there's obviously a sharp change dynamic where in normal SAS, you wouldn't get to see the underlying infrastructure, you won't get to see how things have been set up. You wouldn't be able to go and poke around at any of these things. And what we've had with quite a few clients is that they they deliver Little bit too deep or they go and change things that maybe they shouldn't have or they go and break things that maybe they shouldn't have. And those are quite difficult for us to manage because obviously we've we've gone into played something and we've deployed in a way that we expect it to work and then someone else has come in and turn something off or broken something or changed our access, whatever whatever the case might be. So there's those trying to balance that is is quite difficult. So we've got a lot of you know, constantly checking for any drift detection style systems to make sure that nothing has changed and everything is staying things exactly as it is. So there's a lot of those those challenges with I guess what what you call the shared responsibility model between us and the client in terms of managing that sub account which we do struggle with do struggle with sometimes. The other the other interesting things with with private SAS is just how auditable and how exposed it is. So as we as we work with more security conscious clients, we do end up getting audited a lot. We have a lot of very low conversations about you know, how things been set up how things need to change to suit their, their particular business needs and where normally you'd go and buy a service. And you're not too worried about exactly how how they've instrumented suddenly, when we're deploying inside the clients ecosystem. We have to fit their checklist, we have to fit all of their security concerns, we have to pass muster with their security teams of their SRU teams to make sure that you know, everything is exactly how it needs to be be for them. So we've got a lot of challenges there not only in, you know, managing and orchestrating and running the whole thing, but also just getting signed off for a lot of these teams. And you know, is this is this up to up the spec. So we have, we have a lot of extras added into the platform as we have these conversations that we have to adjust and manage in such a way that it is still scalable, it is still going to work for everyone but we have to add lots of extra things on the
Tobias Macey
0:08:54
fly. And because of the fact that you are running in the customers account, I'm sure that there's also some A measure of cost consciousness in terms of the bill for running all these different resources and handling, scaling and trying to minimize the amount of resources that are necessary to keep this running. Because in a SAS, the provider eats all of those costs, and just passes that on in terms of the cost of a charge to the end user. But because the end user in this case is running all of this in their own infrastructure, they're much more cognizant of the actual overall cost of running all of these pieces of infrastructure. And so wondering how you handle minimization of the resources necessary while still allowing for robustness and scalability in the platform that you're deploying.
Josh Beemster
0:09:41
So that that's really really pertinent question. It's a great one to raise. It's it's a bit of a balancing act. We we do have kind of hard rules that we we don't want to breach when it comes to deploying a production environment that we do get signups from the client from. So for example, you've got production environment, it has to be highly available. So you have to have a minimum of you know, two availability zones, you need to be setting up enough servers that if a catastrophic data data center failure happens that happens. So there are carve outs to say there's a minimum spec for what what this looks like. But beyond that the the architecture is is flexible enough that we can save costs in a lot of in a lot of ways. So, you know, we work quite closely with clients on how we do instance reservations, how we write size pipelines, how we write size it for their particular traffic patterns. So there is a level of customization and and work that we have to go through to make that happen. On the whole though we've, we've kind of, we've come to a pretty good a good level of, I guess, you know, what do we what are the minimum things that we need, and that's where we we sort of start and as clients ramp up traffic, we don't tend to have the harder discussions around Okay, to make sure that this is stable. These volumes are going to have to massively upscale kinesis, for example, to make sure that there's no no back pressure issues, or you've got a one second latency required, which is going to require these sorts of these sorts of changes. So every client is different. In that sense, some of them won't mind there being a bit of latency build up in their pipeline, and then we can size it down for cost some very, very latency conscious, but far less cost conscious. So it also, you know, in that sense, comes down to what what is the day that there was the value of the data for the company, if it's just a report there runs once a week we can, we can definitely optimize for cost. If it's something that needs to be run every second, then we need to optimize for performance. It's very it's very conversation heavy, heavy topic with with clients that we we find there are obviously blanket strategies instance reservations are classic one, you know, turning off certain parts of the service as well. It being quite a modular data pipeline, you kind of plug in different different exploits. For example, in, you know, the real time pipeline that we deploy on AWS, you can send data into Elasticsearch, you can send data into s3, you can send data into snowflake DB send data into indicative you can send data into redshift, but you don't have to pick all of these all of these targets so we also work with clients to figure out well what is the best way for you to consume this information and then setting up their their pipeline accordingly sternly paying for what they really need. But you're exactly right that the fact that it is running in desktop account they do they do by that cost. On the flip side to that particular point, though, is an interesting one, which is that none of our clients really have to worry about volume based pricing in the same way as you would with with a competitor like segment, for example, or any of the other SAS SAS analytics providers which are volume based pricing. So with snow plow as you scale up as you track more your costs, and not Going to drastically increase, they will increase linearly with infrastructure costs. But there isn't kind of an exponential cost growth that comes generally with with volume based pricing.
Tobias Macey
0:13:11
And then another thing that's interesting about your model is that shared responsibility that you mentioned of because the servers are running in the clients account, and they have their own way of managing infrastructure, I'm sure that there are some instances where you have conflicts as far as how they would prefer to handle deployments where they have their own infrastructure automation and configuration management. And then on the monitoring side, I know that you keep track of the health and well being of the overall system, but I'm sure that the customer is also interested in being able to consume those metrics into their own systems to get visibility. So I'm wondering how you handle that aspect of the responsibility being on your end to keep everything running but at the same time, the customer wanting to have greater visibility and tight integration with the system. That they already have deployed.
Josh Beemster
0:14:01
Yes, that that is a really common theme on the on the monitoring side, we do tend to leverage on Amazon cloudwatch. And on GCP Stackdriver quite heavily to alleviate that, that to some extent. So rather than figuring out how do we, you know, pull these metrics in and turn them back to the client in a nicer portable way, we leverage the clouds own tooling tooling there and then provide kind of easier ways for them to hook into you know s&s topics for getting all of the same alerts that our own ops teams get. So they can, they can look at that, but also by virtue of the fact that it's in their sub account, they can explore any metric that's exported to today's systems, and that's where all of our all of our metrics live is in is in those those systems is very little bespoke monitoring, so to speak, that that's coming only to us that only we we have visibility over. So on the monitoring side that that's mostly handled now, which which does make life, life easier on the on the customization side or kind of we're not fitting the perfect mold that is that is often fairly difficult there are times that we we do need to develop more more custom solutions, and which will do kind of on a on an ad hoc basis. The difficulty for us in in offering any any sort of bespoke solution, though is that we are running it across, you know, 150 different stacks. So, the name of the game for us has to be consistency across across the client base. What ends up happening is that as clients have more security requirements, it becomes part of the, the kind of standards that we make available for everyone. So any feature that we develop for one is developed for all and in that way, as we go we'd end up being able to take more of those security boxes without necessarily having to do something bespoke every single time and in a lot of cases we do manage to convince them that making too many changes. Notice Always, not always necessary, and that we can, we can kind of had those shared responsibility where we run things, how we need to run them without needing to change everything, we're yet to come up against someone that really won't won't let us work how we want to work.
Tobias Macey
0:16:13
And another element of customization is the fact that the overall snowplow pipeline is very composable. And different components within the stack can be swapped for some equivalent system, for instance, the kinesis that you would run in an AWS account might be replaced with Google Cloud pub sub and a Google account or if somebody's already running Kafka or pulse are. And so I'm wondering how you approach that aspect of customization and allowing the customers to be able to specify how they want different elements of the system to be manifested based on their preferences or what they already have running to allow for better integration with the data systems that they might want to integrate with
Josh Beemster
0:16:53
the managed service that we offer at the moment on GCP. We only only support pub sub and on Amazon, we only support kinesis. So for that core part of the pipeline where we're kind of collecting and enriching and storing that data, there's not a whole lot of flexibility just yet, in terms of the fact that we we have to own and have a standard part of the pipeline that is the same for everyone where we allow a lot, a lot of flexibility, though, is what you can plug into the pipeline on top of that. So for example, if you've got a big internal Kafka cluster, what we see a lot of clients end up doing is streaming all the data that we placed into kinesis into their own calf cluster to do larger fan out operations, which kinesis doesn't support as well as Kaffir as Kefka might do. The key issue there, though, that that that's worth touching upon is not that we don't want to support you know, loading into someone else's, someone else's data stream is so much that we we support so delays on on the latency of of data into those data streams. And if we have a third party dependency that we can't control that's very difficult to meet. For example, if if client did want us to load directly into their Kafka cluster, but we had no authority over said Caprica cluster and they had a massive spike in traffic, there's no way for us to really account for that there's no way for us to go and say, Hey, you know, we, we need to increase the size of your Kafka cluster because the pipeline can't keep up. So there is a certain need for a separation of concern there as well so that we can ensure the health of the pipeline and having too many external dependencies or any external dependencies makes that exponentially more more difficult not only from a term in terms of you know, making sure that the pipeline is working, but even just debugging why why something's happening becomes harder because you're not in control of the entire entire fabric, the entire system. So copart very, very locked down to what what we what we support, and we we very much want to be in control of that. But yeah, as I said, forking off the pipeline into into other systems is definitely something we see see a lot of. So writing custom lambda functions or Google Cloud Functions or you know, streaming that data with kinesis to Kafka connectors is quite common patterns to kind of push that data into into new and different different places. And we obviously also have our own Analytics SDK is that can plug in on top of that streams and then do custom mutations before sending it sending it somewhere else as well.
Tobias Macey
0:19:31
And so in the overall system, which components are the ones that are most subject to variability in traffic or resource pressure, and what are some of the strategies that you use to ensure proper capacity as there might be burstiness and the events that are being ingested or being able to meet some of those latency SLA is that you mentioned so
Josh Beemster
0:19:53
obviously, we we leverage a lot of a lot of auto scaling to to account for that that business, but not everything is auto scale. So what the biggest issues that we we come up with in terms of dealing with that burstiness is generally to do with how fast we can get new EC to nodes online. But generally, it's with the few non auto scaling components within within the pipeline. We take GCP. For an example, we're using pub sub there now pub sub is this beautifully elastic auto scaling system where you can throw as much as you wanted it and it will scale to meet meet demand without any issues. Where we run into issues is on the flip side, and how we run AWS, which is using kinesis, which has kind of fixed fixed size and Kafka or as your Event Hubs would have kind of the same, the same sorts of issues where you've got much more fixed kind of sharding based Ingress limitations. there's kind of two ways we tend to tackle that one is with so we've built our own sort of proprietary auto scaling text that kind of goes and does recharges of kinesis as and when as and when needed. But that tends to fall over quite quickly at higher shining rates, where I'm talking, you know, if you're getting into the hundred or 200 plus shards, doing a resize can take anywhere up to 3035 minutes, which is often far too slow for a very big burst burst and traffic. So in these cases, we tend to work with clients and look at their traffic patterns and look at, you know, where they're going to be evolving up to, we do do quite a bit of trend analysis there. And and we can say, Well, if you want to keep the pipeline healthy, we're going to have to get this much of a buffer in place for this non auto scaling component. Otherwise, we're going to run into issues that are not going to be recoverable very quickly. This is obviously not the best strategy. You're instead of having a nice auto scaling elastic architecture, you've suddenly got hardcoded capacity, which means that we have to have around the clock ops ops availability to you know, Check for alerts check when you're starting to reach, reach those thresholds and go and scale that up. We're kind of actively looking at alternatives at the moment for for kind of how we can swap out those systems for something a bit more PubSub esque, especially, especially on Amazon we're looking at so you know, how could we maybe swap out kinesis for Sq s and s&s for that similar sort of elastic auto scaling queuing with with fan out rather than rather than leveraging leveraging something like kinesis, so that's in the streaming side, that's probably the biggest the biggest bottleneck. The rest of it all is quite easy to auto scaling is generally generally quite fast. The other areas we have issues is is done with downstream data stores that are by nature a lot more a lot more static in size. So snowflake DB in a lot of ways to solve that. And BigQuery obviously is sold out as well where it's kind of unlimited storage capacity, you can just throw whatever they You want into there and it's backed by blob storage. So you can as you have your data lake and in that sense, where we didn't run into some issues, which, you know, redshift is starting to address within you, you instance types that they've released, but redshift and Elasticsearch still serve as a weak point in the architecture because there is capped capacity. And especially when you're looking at a streaming pipeline, and you want to stream data in as quickly as it's arriving, big spikes and traffic can overwhelm CPU resource they can overwhelm suddenly the amount of provisioned, you know, capacity that you have for these systems, which can cause service interruption and downtime. So we have again, around the clock teams that are waiting for these alerts to go in upscale systems as and when as when we breach breach, I guess the strategies there are to look at patterns and look at you know, what, how is my evolution of tracking been developing over the last months how has the pipeline handled spikes in the past and then sizing it up with a healthy healthy buffer to make sure that when these things happen again, you're covered. But there's there's a limit to what we what we can do especially running so many of these systems to then, you know, try to catch all edge cases, which is why we still need that that around the clock ops team to deal with that.
Tobias Macey
0:24:18
And then another interesting point for me is the fact that you ended up going with Nomad as your substrate for being able to handle bidden packing and managing the processes for all the different components that you're running where a lot of the mindshare right now is with Kubernetes. And so I'm wondering if you can talk through the the overall decision making process that led you to that conclusion and maybe talk a bit about some of the ways that your infrastructure management has evolved since you first began tackling this capital. So
Josh Beemster
0:24:48
just a quick point of clarification. So for the client client side pipelines, where we are actually leveraging Cuban Eddie's in the GCP pipeline, and with Looking to leverage ECS in the in the Amazon pipeline so we only use Nomad internally for internal orchestration and scheduling fabric memories and we've we've chosen Nomad for that task is really just it's it's deep level of integration with the rest of the hashey stack. So it seemed natural to say well we're using terraform we using console using volte using Packer, we should use Nomad as well, because it's older, all the kind of nice, neat, neat native native integrations, but on on the latter point around and I can talk to them why we've used ECS as well, possibly possibly after but the larger point on where our evolution has has happened. That's that's a very long, long history there. So where where we started a few years ago, we had much fewer clients and much simpler tech stack. And we could kind of get by with with a lot of manual work. So the decisions we made at that point in time were very much you know, we could do a human driven approach to deploying infrastructure. We could take some, some measures to be Yeah, we'll put some of it in cloudformation first, so we'll have some checklists. And we'll go through it and and we'll, we'll just kind of get things running as quickly as possible. So it all started with just Ansible playbooks that would then run some cloud formation that would spin up spin up the pipelines. When we first started writing, writing that that automation, we made a few big mistakes we made it made. We made several big mistakes. Mistake Mistake number one was not making the infrastructure granular. So where we'd you know, be deploying a VPC and a Elastic Beanstalk stack and maybe some Amazon s3 buckets, maybe a redshift cluster as well. We'd put all of that in one, one giant cloudformation template, and at the time that that made perfect sense, right, we had one version of what we're playing and we going to play it and then we'd run into all these interesting issues where you You'd have clients say, Well, I don't want this part of it only on this part of it, you go. Okay, so now I've got a whole new version of my stack. So you fork fork that stacking, okay, you get this version, you get this version, then you've got all those permutations. And what what we quickly realized is that what you need in infrastructure automation is is not that big bank stack that you need very, very high granularity and all the components and you want the same way as you mentioned that snowplows very composable from lots of microservices, you we needed to approach infrastructure in much the same way or even more, more composable. So where where we are now and in that journey is that we went from Big Bang Ansible playbooks with very large cloud formation templates. We then move that to a bespoke tooling system, I guess, which was still based on Ansible and cloudformation. But had a lot of that granularity starting starting to appear which which worked for quite a long time, but still wasn't wasn't very flexible. Part of that was was probably our use of cloud formation. And that we found that to be a little bit awkward to work with, and a little bit awkward to make very, very, very, very flexible. But the the key journey there was really about going from kind of low granularity to high granularity. And then we ran to a further issue, which was about state. So up until we started using terraform. All of our deployment tools have been completely stateless. We'd we'd leveraged essentially the fact that, you know, we were using cloudformation. So we could just query the outputs of cloudformation templates, or, you know, we were we were going in writing API calls to go and check if certain components have been deployed and what configuration they had at time. So it was all kind of just in time, just in time resolution and that that was really flexible, and kind of reduced any any need for us to case you're worried about stayed anywhere, but it also made us very lazy, and that we weren't caring about that. So we we weren't necessarily making the right decisions. And as well, every time we wanted to expand the system, we had to go and fetch all this information all the time. And it also meant it was very hard to build a view of what had been played, it made it very difficult to go and write a tool that could just build a report and say, Hey, this is everything that's deployed. This is current state of the entire system, because it was all stateless. So to do that, it was all very expensive, you know, long API calls and checks that were just not not very useful. And that that system, that bespoke system being what it was, was very hard to then turn into a nice API, it was also impossible to train the rest of the team on it, which I quickly discovered as I started hiring, hiring more SMEs and trying to train them into using this tool that no one could actually use it easily. So that's that's when the kind of next part of our journey began, which was saying, Well, hey, let's let's throw it all out and start again, which we started actually at the beginning of last year, and we we set it on terraform because it was it was kind of flexible enough to support multiple clouds which we which we now do. So we needed something that could have a common instruction language GCP or AWS and possibly in the future is your any other any other cloud that might might appear. We also wanted something that a new engineer joining could kind of deal with, they wouldn't have to learn the configuration languages just have to learn what we built on top of it, which which was a massive, massive difference in terms of how well we could we could support this and it also had all the all of the heavy lifting done it had state and had integrations with console and volt which were leveraging quite heavily to build detachments and you know, metadata management, centralized metadata management, centralized secret management that we could then feed into all of these this configuration that we we've done with terraform and as well with with Titan, you know, as I mentioned previously with our adoption of Nomad as well, we've now Being able to slap an API on the front of it all, which we call which call our deployment service, which is very very aptly named that can then go and go and use this whole ecosystem to go and manage manage all of this this infrastructure. So we've come from, you know, a human driven choose your adventure style infrastructure management tool based on based on cloudformation Ansible. To kind of this this world of API driven chat ops driven infrastructure management, which is which is kind of where where we've gotten to continue now. I guess the the other quick thing I'd love to touch on there as well that you mentioned is that kind of Cuban eddie's is that the the flavor flavor of the month for most for for how it runs managing managing containers? I guess on on Google we we did we did roll with Cuban Eddie's we did that because Google has like a fully managed communities offering which is very attractive to us because then we we didn't have to home roll our own Cuban Eddie's and that's that's also a big thing. We look for Our implementation is we need minimum overhead in every way shape and form, we can't have too much overhead, we can't do too many custom things, when we can use a cloud tool, we use the cloud tool, because that is very important for us for scaling out scaling out our operation and for GCP cubing it seemed seemed like a very good fit for AWS though we're we're looking in a slightly different direction. There's a few there's a few reasons for that. So one, one is that Amazon managed keeping it isn't quite the same breed as the Google managed communities you still responsible for kind of setting up the the underlying data nodes. So you still need to do it was a guess a bit like when elastic Container Service first arrived and you still had to manage all of those EC two servers yourself. It's not a fully managed service. In that sense. You still responsible for setting up your auto scaling group setting up their service and kind of hooking them up to, to to cluster So does that There's that reason but as well what what I found personally and you know, take take this effort for grain of salt is that there is a lot going on in Cuban Eddie's and for what we're trying to do with it it's not that it's overkill, but it's it does more than what we needed to do. And by virtue of that fact it costs more than then maybe we want it to cost in terms of you know, we don't need all of this advanced extra scheduling and management systems necessarily to just run a couple of pods. All they need to do is run a simple, simple Docker container that scales up and down. There's no extra service discovery, there's no you know, internal load balancing needs to happen. That's all kind of done done for us already. So we're we're looking at ECS is kind of just a very simple container container management fabric as opposed to converters, which to its credit is much more powerful, but it's just much more than one We tend to tend to need for the snowplow stack.
Tobias Macey
0:34:04
And in terms of your experience of building out this automation and managing this platform, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
Josh Beemster
0:34:16
That's a it's, it's, that's a tough one. That's a tough one, I think most most interesting has been this is kind of in the time working with clients, but just how, how deeply interested a lot of clients are and kind of how how we're deploying it so and how hands on all of these teams want to be so that's been a it's been an interesting and challenging lesson in in terms of having having to kind of defend your work constantly, I guess is the point of trying to trying to get to there's there's a lot of you know, if you were developing a system that was kind of internalize only you don't expect so much. So much attention on it. You don't expect people to rip it apart and quite so many ways. So that's been that's been a challenge in, in developing the stock is definitely a good thing to have, I think if you if you onboard a client and every time they kind of audit you, you can only get better. And we've seen that as we've evolved is that, you know, it's been challenging to try to meet all these requirements and meet all these expectations. But that's that's actually worked out for the better because we've got a much better system at the end of it than we than we otherwise would have. And I think the other the other difficulty is, is really just figuring out how to manage so many servers currently and how to monitor all of them and how to ensure the uptime of of all of them. You know, we run into a lot of really challenging challenging scenarios with with what we've what we've deployed one, one big issue is actually in in kind of anything that has a sort of ephemeral nature. And one big reason why I'm and I think there's another one Question coming about what what things we want to change in the stack. But batch processes are ephemeral processes are much more fragile than you could possibly imagine. And in a situation where you're just you're one company running one ETL process per day, you probably won't run into these issues. But we often see giant cluster failures across across Amazon, which, you know, is massively challenging. As we've scaled out, we've started to be impacted by those a lot more obviously. So, for example, we might have the EMR API fail across US East one for you know, a couple of hours. Now, again, if you're running one ETL job, you go, Okay, I got one failure. One thing that I got to go and recover on our side, we've got our support team, which might have 40 or 50 failures that they could have suddenly clear and communicate out to clients and make sure that they understand what's happening. Why is there their data late? Why is it not arriving in the data warehouse? So that's, that's definitely it's big challenge. I don't I'm not sure if it's a Unexpected challenge with it's definitely an interesting one to figure out how we how we deal with and you know, and just how we deal with any of the scale scale problems really as as kind of a small, smaller sorry team trying to figure out how we how we manage the infrastructure, so many clients and keep it secure, and make sure that you know, none of them are costing too much. So there's a lot of lot of interesting challenges there that, you know, we're looking to solve in some parts with with actual snowplow monitoring. So we we build a lot of lot of monitoring on top of all of these systems to kind of try to do some trend analysis and we're hoping to get a lot more time to look at you know, solving these challenges with you know, machine learning to try to figure out how can we detect trends and data How can we scale up systems intelligently ahead of ahead of anything, anything happenings we can provide the best possible possible service, lot of scale problems
Tobias Macey
0:37:59
and are there Any elements of your experience of managing the snowplow platform that you think are more broadly applicable to data infrastructure as a whole that you think are worth calling out?
Josh Beemster
0:38:12
I think the biggest one is probably just, you know, General general rules, this infrastructure has to be super, super composable. It has to be as granular, as granular as you can, as you can make it. If you want to be able to evolve it quickly. If you want to be able to change things quickly. If you want to be able to, you know, attach, attach lots of different pieces together, start it very composable Don't, don't Big Bang, an infrastructure stack that will that will catch you out very, very, very, very quickly. So I think that's that's probably the biggest, biggest learning I've had and also kind of how you scope your resources. So making sure that your understanding is regional contracts global contract How do you how you group things, how you manage your different different infrastructure resources, I think from kind of just managing data pipelines, getting a clear understanding of what you're tracking, why you're tracking it, and how much data you're expecting to collect a one to collect is, is incredibly important. Making sure that you you kind of design it for design it for purpose, I think going to generic with what you're, what you're building doesn't really help but really thinking about, what am I what's the what's the end goal here. So for example, within within our own kind of internal pipelines, we've got no an ops or monitoring pipeline, it's really streamlined for dealing with you know, lots and lots of small, small metrics. And we've got our kind of business pipeline which is more geared towards, you know, our longer term analytics of our website and those kind of things. So figuring out What structure you need to serve serve the business is also really important.
Tobias Macey
0:40:04
And if you were to start over today with all of snowplow and the infrastructure automation that you're using for it, what are some of the things that you would do differently or ways that you would change some of the evolution of either the snowplow pipeline itself or the way that you've approached the infrastructure management?
Josh Beemster
0:40:22
So I won't, I won't speak to the kind of the overall snowplow snowplow components. But for the infrastructure side of it, I think the biggest, I think we'd probably not go for using something like kinesis, or we would have approached it very, very differently. I think we would have leveraged something that is truly auto scaling. That's that's probably one of the biggest infrastructure problems we have at the moment is is kinesis and its lack of electricity. It does cause us causes a lot of lot of heartache because there's a lot of heartache and I think beyond that in terms of Managing the infrastructure, where we are now is pretty close to where I wanted to end, as I mentioned, we did rebuild the entire system last year. So that that was with, with kind of a lot of years of learning going, how can we do things better how can do things differently? I think if I could take go back to the start of last year and take a slightly different tack, I would go even more granular in terms of the stack separation and the topology of infrastructure that we've built, we've still gone to Big Bang, which is making it quite hard to evolve certain parts of the infrastructure stack that we're we're trying to manage. So we've we've coupled components together that should have been split apart. We've made it quite difficult for our for ourselves, and we're going to be paying for that for that in the near future. How do we unpick some of these associations and how can we make it more more flexible again, and that's that's really a problem that We suffer from just because we have so much variance and what our clients want. So there's not kind of one pipeline is this pipelines of lots of different pluggable components is needs for, you know, having maybe multi region support multi cloud support for for kind of sinking data into a single pipeline really custom, you know, fabric, custom ETL systems that are required, which yet just the more granular we are, the easier that would have all been, and not that we've we've coded ourselves into a hole, but we do have some work to kind of make it that much more flexible again, that's really, you know, I guess as a learning as we as we go that we just need to be super, super flexible. Because every clients going to have different wants different needs, and we want to be able to serve them all but in a manageable way. And that is, that's difficult if you've only got one version of what you can, what you can set up
Tobias Macey
0:42:57
and what's in store for the future. of the snowplow product and the way that you're approaching management of the service that you're providing for it. So
Josh Beemster
0:43:07
hopefully coming soon, and hope the product team get upset at me for saying any of these things. What I'm hoping for quite soon is that from our managed services UI will be able to start managing infrastructure directly from from that us at the moment way customers interact with that it's, they have a lot of transparency when they log into the sub account or into the GCP projects and go and look at things but they don't have I guess, a lot of a lot of visibility over you know, what is configurable? What is what are all the different options for my pipeline that's still somewhat hidden away. So hopefully coming soon, we'll have a lot of that surfaced into the UI and almost like ETL builder style things. So if you are a company looking to get a robust, highly available data pipeline, you'd be able to kind of click and drag components set up a snowpack collector in GCP that streams data into a Kafka cluster in AWS and then set up a second collector in AWS. So you kind of have that high availability not just across regions or across availability zones, but across clouds. And that's, that's really where I see the infrastructure going. It's an will need to go for that really true robust, highly available structures banking on one cloud is is not risky, but there's definitely there's definitely a requirement and will be more requirements to have that much more split and much more spread. So that you can really have that you know, that hundred percent uptime, taking all all kinds of worries away that you know, if Amazon has a has a glitch, or gcps a glitch that that's not going to suddenly stop your your pipeline from from working. It will it'll still keep carrying on. It's just all that that extra failover the other the other big thing changes. And I sort of touched on this with, you know, when we were talking about batch processes or ephemeral EMR processes that a lot of that or if not all of that will be moving to streaming architectures quite soon. So where we're currently there are still some batch processes left snowplow will be moving to kind of 100% streaming, which we're very hopeful will result in not only better cost efficiency, especially at scale, but just much higher robustness and stability of the platform. In general, there's a lot of there's a lot of difficulties with batch that we we struggled to overcome. So you know, dealing with you know, what is my cluster specification between between spikes is is quite difficult. You know, you don't necessarily always have you can't auto scale a batch process in the same way that you can auto scale a streaming streaming process. You can, you know, deal with the overhead of you know, I've had a spike in traffic. I seem to grab a few extra servers or I've had a spike in traffic for a really long time. And I now need to get an enormous cluster to get through this backlog in a reasonable amount of time. Those kind of problems we're hoping will just go away with with the migration to a full, full streaming architecture. It's definitely though, on the flip side, more expensive to run on the lower scale. It's, but we're, I guess we're not really building for the kind of low low scaling or it is it is a big data architecture is a big, big data pipelines and streaming definitely has a lot of cost efficiencies and performance efficiencies at scale. And that's, that's I think, where, where we want to take take snow Cloud Next,
Tobias Macey
0:46:41
are there any aspects of the snowplow product or your management of the infrastructure and the services you provide for it that we didn't discuss yet that you'd like to cover before we close out the show?
Josh Beemster
0:46:51
I think that that's probably that's probably everything.
Tobias Macey
0:46:55
All right. Well, for anybody who wants to follow along with The work that you're doing or get in touch, we'll have you add your preferred contact information to the show notes. And as a final question, I just like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
Josh Beemster
0:47:12
I can't see a big a big gap actually, in the tooling. Now, there's so much amazing tooling now for managing managing this stuff. It's it's just maybe, maybe the biggest gap is probably some way to harmonize a lot of the tools that are available now, which, which a lot of cloud providers are starting to do. But it's still, there's still a lot of, there's always too many options in in managing data that it makes it hard to know which way you should go. Should you put everything on s3 and kind of paquet? Should you put everything in redshift, you put everything in stumps, ideally, should you put everything kind of everywhere for different business use cases? I think it's probably this is no silver bullet. There's nothing that solves all of these all of these problems, and they might never, never be, but there's is just such a wealth of options that it's quite hard to pick any one option. I don't know if that's that's a good gap or if that's just you know, factor of too much choice.
Tobias Macey
0:48:10
It's definitely something that continues to be a problem as the is the the Paradox of Choice, particularly as we add new platforms and new capabilities to the overall landscape of data management.
Josh Beemster
0:48:22
But there's a lot it's it's a lot better than it used to be. That's, that's for sure. So it is, you know,
Tobias Macey
0:48:28
certainly. All right. Well, thank you very much for taking the time today to join me and discuss the work that you're doing at snowplow and how you're managing the infrastructure and automation around that for all of your different customers and the private SAS nature of your business. It's definitely an interesting area that something that doesn't really get a lot of attention as far as how to manage the underlying infrastructure for these data products. So thank you for all of your time and effort on that front and I hope you enjoy the rest of your day. Thanks, guys. Cheers. Bye.
0:49:03
listening. Don't forget to check out our other show podcast.in it at Python podcast.com to learn about the Python language its community in the innovative ways it is being used. And visit the site at data engineering podcast. com to subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects in the show, then tell us about it. Email hosts at data engineering podcast.com with your story, and to help other people find the show. Please leave a review on iTunes and tell your friends and co workers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!