Shining A Light on Shadow IT In Data And Analytics - Episode 121

Summary

Misaligned priorities across business units can lead to tensions that drive members of the organization to build data and analytics projects without the guidance or support of engineering or IT staff. The availability of cloud platforms and managed services makes this a viable option, but can lead to downstream challenges. In this episode Sean Knapp and Charlie Crocker share their experiences of working in and with companies that have dealt with shadow IT projects and the importance of enabling and empowering the use and exploration of data and analytics. If you have ever been frustrated by seemingly draconian policies or struggled to align everyone on your supported platform, then this episode will help you gain some perspective and set you on a path to productive collaboration.

Many data engineers say the most frustrating part of their job is spending too much time maintaining and monitoring their data pipeline. Snowplow works with data-informed businesses to set up a real-time event data pipeline, taking care of installation, upgrades, autoscaling, and ongoing maintenance so you can focus on the data.

Snowplow runs in your own cloud account giving you complete control and flexibility over how your data is collected and processed. Best of all, Snowplow is built on top of open source technology which means you have visibility into every stage of your pipeline, with zero vendor lock in.

At Snowplow, we know how important it is for data engineers to deliver high-quality data across the organization. That’s why the Snowplow pipeline is designed to deliver complete, rich and accurate data into your data warehouse of choice. Your data analysts define the data structure that works best for your teams, and we enforce it end-to-end so your data is ready to use.

Get in touch with our team to find out how Snowplow can accelerate your analytics. Go to dataengineeringpodcast.com/snowplow. Set up a demo and mention you’re a listener for a special offer!


Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $60 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Are you spending too much time maintaining your data pipeline? Snowplow empowers your business with a real-time event data pipeline running in your own cloud account without the hassle of maintenance. Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and autoscaling so you can focus on your exciting data projects. Your team will get the most complete, accurate and ready-to-use behavioral web and mobile data, delivered into your data warehouse, data lake and real-time streams. Go to dataengineeringpodcast.com/snowplow today to find out why more than 600,000 websites run Snowplow. Set up a demo and mention you’re a listener for a special offer!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Sean Knapp, Charlie Crocker about shadow IT in data and analytics

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by sharing your definition of shadow IT?
  • What are some of the reasons that members of an organization might start building their own solutions outside of what is supported by the engineering teams?
    • What are some of the roles in an organization that you have seen involved in these shadow IT projects?
  • What kinds of tools or platforms are well suited for being provisioned and managed without involvement from the platform team?
    • What are some of the pitfalls that these solutions present as a result of their initial ease of use?
  • What are the benefits to the organization of individuals or teams building and managing their own solutions?
  • What are some of the risks associated with these implementations of data collection, storage, management, or analysis that have no oversight from the teams typically tasked with managing those systems?
    • What are some of the ways that compliance or data quality issues can arise from these projects?
  • Once a project has been started outside of the approved channels it can quickly take on a life of its own. What are some of the ways you have identified the presence of "unauthorized" data projects?
    • Once you have identified the existence of such a project how can you revise their implementation to integrate them with the "approved" platform that the organization supports?
  • What are some strategies for removing the friction in the collection, access, or availability of data in an organization that can eliminate the need for shadow IT implementations?
  • What are some of the inherent complexities in data management which you would like to see resolved in order to reduce the tensions that lead to these bespoke solutions?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:11
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need some more to deploy it. So check out our friends over at linode. With 200 gigabit private networking, scalable shared block storage, a 40 gigabit public network fast object storage and a brand new managed Kubernetes platform you get everything you need to run a fast, reliable and bulletproof data platform. And for your machine learning workloads. They've got dedicated CPU and GPU instances. Go to data engineering podcast.com slash linode today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of this show. Are you spending too much time maintaining your data pipeline, snowplow empowers your business with a real time event data pipeline running in your own Cloud account without the hassle of maintenance snowplow takes care of everything from installing your pipeline and a couple of hours to upgrading and auto scaling so you can focus on your exciting data projects. Your team will get the most complete, accurate and ready to use behavioral web and mobile data delivered into your data warehouse, data lake and real time data streams. Go to data engineering podcast.com slash snow plow today to find out why more than 600,000 websites run snowplow, set up a demo and mention your listener for a special offer. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media chronium, global intelligence, od sc and the data Council. Upcoming events include strata data and San Jose and pi con us in Pittsburgh, go to data engineering podcast.com slash conferences, to learn more about these and other events and to take advantage of our part Enter discounts to save money when you register today. Your host is Tobias Macey. And today I'm interviewing Sean nap and Charlie Crocker about shadow it in the data and analytics space. So Sean, can you start by introducing yourself?
Sean Knapp
0:02:12
Yeah, absolutely. I'm Sean nap. I am the founder and CEO of sn.io.
Tobias Macey
0:02:19
And Charlie, can you introduce yourself as well?
0:02:29
And going back to you, Sean, do you remember how you first got involved in the area of data management?
Sean Knapp
0:02:34
I do. It was 15 years ago. Now. I just joined Google as a front end software engineer on web search. And we were obviously known for doing a lot of experimentation on the user experience of background color of adds to the layout of the page. And I quickly found that even the smallest experiment where you move around just a few pixels on the front end. Oftentimes, what brought with it doing hours and hours of data analysis by reading MapReduce jobs. In the internal language, they're called cells off to analyze the usage of hundreds of millions, if not billions of users. So really quickly and early on in my career, fresh out of college found myself doing a lot of pretty complex Big Data jobs to just answer questions around consumer experience on what search?
Tobias Macey
0:03:28
And Charlie, do you remember how you first got involved in data management?
Charlie Crocker
0:03:33
I was an environmental consultant. Many years ago, 25 years or so ago, and in the organizations that we worked in, everybody was using paper and sometimes, maybe if we were lucky spreadsheets, and so we were asked to make charts, graphs, maps, all sorts of information. And as a junior staff member, you were expected to just sit in your cube and draw these things out. So I found a way to work directly with the analytics lab. Get them to start sending us the data in electronic format. And I brought this interesting concept into the environment, the consulting world called a database. And that led to building out databases, maps, and then moving into larger scale and larger scale data sets as I
Tobias Macey
0:04:17
as my career progressed. And so, as I mentioned at the open, we're talking about the idea of shadow IT and how it manifests in the space of data management and data analytics. So before we get too far along, can each of you start by sharing your definition and how you think about the term shadow it?
Sean Knapp
0:04:38
Sure happy to go first. You know, oftentimes when I think about that the notion of shadow It is really what we see happening from consumers and customers of it. Oftentimes, they're looking to test out or experiment or try new technologies to meet some their needs. And the demand that usually behind it is to be looking for technologies and capabilities that are brought to us to the broader organization. And oftentimes, the different customers of it are looking to go faster or get more experimental or simply self service and offload that burden from it. And so you'll start to see those business units take on more of that responsibility on their own and bring more of it in house, if you will, into their own organization.
Charlie Crocker
0:05:31
Yeah, and I think the term shadow it is, is interesting, and it kind of has, I think, a little bit of a negative context. I really see shadow it being the environment where people choose to work in silos or in sometimes in in business units. They're the this has really accelerated I think a lot in the last probably five to 10 years with the advent of all of the cloud service providers because it's easier and easier. easier for people to do their own it without having a full IT organization. So you're starting to see a lot of that experimentation and a lot of that acceleration without necessarily having a central organization to drive standards, etc.
Tobias Macey
0:06:18
One of the things that you were both pointing out is this pressure to be able to deliver and how the availability of some of these different services that are easier to access and easier to provision without necessarily needing to fill out some sort of a purchase order or, you know, rack a number of servers has increased the viability of shadow it as a strategy. And so I'm wondering, what are some of the sort of main drivers of that tension and of those types of projects that are leaning on these different self provision services that help to contribute You to these projects that are maybe not driven by the engineering or IT organizations within a business.
Charlie Crocker
0:07:09
To me, it's it's, in most cases, it's about acceleration, or it's about fiefdoms. It's about trust. So you see in organizations that have a lot of m&a people coming in with their own way of doing business and finding it hard to adapt and move into an existing organization. You find existing organizations where there's, in some cases, a lack of trust between the engineers that build the products and the IT organization that may be tasked with managing the services and the contracts with with the big vendors. And then you've got the conflict between the goal of getting a new product out really fast and leveraging standards and using standard operating procedures.
Sean Knapp
0:07:56
Yeah, I would double down on on what Charlie saying there, which is You know, these are the drivers are conflicting goals, you know, you have oftentimes it is exist to help actually centralize, standardize get economies of scale and leverage, which by design means they should move slower and more thoughtfully more carefully. And at the same time that is in conflict with oftentimes the the needs of the business, which is to move very quickly and at times, even to perhaps even break some glass in the pursuit of moving quickly in response to business demands. And I think that, you know, the reason this exists it is, you know, if we pop back up, even kind of higher up, it's because we're seeing across industries, the wave of digital transformation is pushing businesses to move faster, and they have to respond more quickly. And, you know, for example, we've seen in software development, the introduction of DevOps was, you know, in my view, all about how do we enable more people to build more software applications faster. yet more safely. And we do that through a variety of different constructs. Yet, that same level of agility that we're now getting in software many companies have appreciated for a while we haven't yet received in the data world, we're still doing much more waterfall, the traditional style approaches, and as a result, we see those pressures and that's what's causing an igniting a lot of this behavior.
Charlie Crocker
0:09:24
And there's a little bit of danger in that behavior to with the data, you know, with us with the software piece. I'm not as deeply engaged with that, but with it with the data piece, you end up with siloed datasets, siloed, data pipelines, repeated data pipelines, you increase expense, but you also have a whole different privacy and and, and management schemes that need to be be dealt with. So with data it gets really hard to in a large organization with a lot of different silos and a lot of different processes to really even understand Compliance regimes, for example, I totally agree, totally agree.
Tobias Macey
0:10:03
One of the other things that's interesting about that difference in terms of software versus data projects is the level of impact that can be had throughout the business of a project being delivered in an accelerated fashion, as well as some of the issues around things like compliance or data quality that arise and are somewhat unique to this analytic space. And I'm wondering, what you have seen as some of the challenges posed and some of the driving forces towards building these projects that are may be outside of the supported platform within an organization.
Sean Knapp
0:10:39
Yeah, I think there's a I think it's interesting when we think about how, you know how we were trying to centralize and standard in many ways and I would make a potentially, you know, inflammatory comment, which is, you know, I, I'm not sure it really matters as much these days if we like standardize on are you going to process your data with Hadoop or Spark for exam I'm not really sure that those levels of standardization matter into much as Do we have standardization? And this is where I see a lot of it needs going today, which is, do we say, standardize on? How do we articulate what data exists? How do we unify how we know that it got there? And why did it get there? And then how, where does it go? And so it's much more of the notion of governance, lineage tracking a lot more at the metadata layer. Because this is I think, as Charlie is highlighting is, that's the stuff that gets scary as I do know if your stuff is actually legit and valid. And if it is broken, or does have a bundle, legal
Charlie Crocker
0:11:40
and even
Unknown
0:11:43
an M legal Yes.
Sean Knapp
0:11:47
All of a sudden, it's the like in the software world, like you would make API calls, right? And so if somebody fixed the bug, the next time you made that API call, you'll probably get the right answer. Yet in the data world. You're making copies of data and you're moving all over and So, like, you may have produced a new data set from a buggy piece of code. But how do you know that like that even what the right place or it, you know, as Charlie highlights, like, what should that data have gone there. And, you know, that, gosh, if it went somewhere, it wasn't supposed to do at least know that it went there, and you know, how to retract it, and so on. And so that is more of the problem is, I would argue, is a metadata layer these days of just knowing what is happening with your data.
Tobias Macey
0:12:26
As you mentioned, the tracking of the data is definitely one of the key problems that exists in this particular realm. Because as you said, with software, if there's a bug, you fix it, and it gets redeployed, but with data if it gets copied 510 15 1000 different places, and then you realize, oh, there was one different way that we are attracting it, or there's a particular field that needs to be masked, how do you then go and apply that transformation or apply that constraint on all of the different copies of the data that you don't even know where they all are anymore?
Charlie Crocker
0:12:57
Yeah, or I mean, one of the biggest Is which metric is the right metric? You know, I mean, 10 people can run the same pipeline and call the outcome, the same number, and you could have 10 different numbers. Right. And so, you know, at least early in the transition for a lot of companies, you know, he, he who owns the, the metric, you know, owns the story. And so every individual would want to come into a C staff meeting with their own set of metrics, for example, you know, so how do you from the top down start saying, Look, how do we drive standardization without squelching innovation? And so the stuff that that Sean's talking about around metadata around, being able to have visibility into the pipelines, being able to rank and canonized certain data sets and certain metrics, those are the key things that allow success in a data product or in a data pipeline. within an organization?
Sean Knapp
0:14:03
Yeah, it's really interesting kind of building. On top of what Charlie was saying this, we've watched a number of companies go down the super cool path, which is, they've said, The look like we hate all these little fiefdoms that are like holding tight onto their numbers and literally won't share them with other people or their data sets. I think like, you know, you have large fortune 500 companies, and each, bu has some other data, but they don't want to share it with others. And we've seen really cool, like executive level mandates that say, you know, what, we're going to expose all of our data, all of our drive data sets are unpublished data sets. And if we have disagreements around how to calculate something, or what the definition is, we're going to have that conversation. But we're going to create a higher level of transparency. And the classic objection to that has always been, but now we're giving more people access to more data, isn't that scary? And the response that I think is really helpful is what not if we actually have invested in that meta data layer and have enough intelligence to say, you there are some things you shouldn't have access to. And there's some things you can't take to other places. But if you can automate more of that now you can safely actually enable this level of dialogue, in collaboration across teams that you just otherwise couldn't have. Yeah, governance,
Charlie Crocker
0:15:19
governance, governance, governance, right. And how do you do that without, once again, restricting innovation? I've been in many organizations, I've seen this with many that I've worked with governance is the thing you do at the very end, when you finally are done with everything you've been working on, and you've worked with all the data, then you look at it and say, Wow, you know, am I in compliance? Did we do this the right way? That kind of thing, and shoehorning things into governance may get you there faster, but it slows you down when you start to really try to scale.
Tobias Macey
0:15:50
And another thing too, that I think is worth calling out is to what you were saying Sean about the fear of giving access to data to all the different people in An organization is somewhat related as well to the fact that they might not necessarily have the appropriate background or understanding of how to interpret that data or how to use it for making effective decisions. And so I think in addition to the governance and metadata aspect, there's also the education component to making sure that everyone within the organization is able to actually gain value from the data that they have access to. And so in that scope, as well, I'm wondering what you have both seen in terms of the types of roles or responsibilities that often are the drivers of the shadow IT projects and some of the reasons that those are the types of roles that might be more likely to build out some new platform or new transformation on the data that they have access to, or maybe collecting new sources of data from other systems that aren't already incorporated into the underlying platform that they have access to.
Sean Knapp
0:17:01
Yes, really good question. Yeah. And what we see is usually a few foreign, it's, it's dependent on oftentimes how big the company is, and where they kind of are and that they're that data journey, if you will. But the one of the more obvious ones is, we'll see coming in stemming from the engineering and product teams, the data engineering role, who's constructing a lot of data pipelines, who are trying to source new pieces of data sets, oftentimes, they're part of a even a data analytics or a data science team. And they're the they're connecting essentially, all these various data systems. And they're trying to get access to new pieces of data. They're trying to work with new technologies, to empower the broader group, and oftentimes just can't get those capabilities or can't get those data sets on boarded as part of the standard sort of corporate platform and we see them emerge a lot as Some of the early drivers. We also interestingly see product managers, and even data scientists himself saying look like that, like, I know, we have all this really big infrastructure. And we have these really cool capabilities in a central platform. But like, I'm not a spark expert, I just want the power of spark to run a big job, or I don't want to deal with all these other sort of complex technologies around it, I just want to point it to some data, just really cool data logic, create something that's more automated and move on. And so oftentimes, they start to become the seed for some of these shadow IT efforts and sort of start to trigger some of this behavior of like, Hey, we can really move a lot faster. As a result, if we can can properly free ourselves and and get moving quicker.
Charlie Crocker
0:18:49
Yeah. We saw a lot of, you can kind of do it from sense this sort of federated to centralized federated thinking, and we've seen I've seen in several organizations where the, they had a very fractured fragmented data structure, data silos etc. And then they work very hard to say, look, we can come up with a consistent methodology for how we store the data for where it's located. Maybe we have flexibility on the tools, maybe we have flexibility on some of the, the compute layers, but we're going to only have, you know, a single data store or a single place to put that and we're going to have a supported stack and, and this kind of thing, and that's all well and good, but shadow it in some cases, becomes the residual, the people that are really uncomfortable making the change, right, so making the change from I like command line a dupe and now we're going to start using managed services like glue in the cloud. That's a whole different skill set. And so in some cases, people are sort of clinging to the old tech because that's what they know. And the switching costs for the engineers, the data engineers is is almost too high. And so they'll they'll look to try to keep a fiefdom or a silo that continues to work the way they understand it. And another interesting
Tobias Macey
0:20:18
aspect of this is that as we mentioned before the term shadow IT can have this negative connotation, and it can lead to people trying to hide their activities from the central it or just from the organization at large so that they don't get called out on embarking on some maybe unapproved project or incorporating some technology that hasn't been vetted by the powers that be. And so I'm wondering what are some of the ways that we can try and either eliminate that stigma so that people are more willing to be upfront about the fact that, hey, I tried this thing, it's having this useful outcome and then being able to then in incorporate that into the rest of the organization or popularize it or add a way for them to integrate the work that they've been doing into the data sources or data processing systems that are being used throughout the organization.
Charlie Crocker
0:21:14
Let me I'll just quick, I have a very just short statement on that.
Sean Knapp
0:21:18
The, the service organization, whatever that central organization is, needs to be read needs to really think of the users as customers. And if you can't provide them with something of value that allows them to innovate and scale, they're going to go somewhere else, right. So you find a lot of organizations where the central unit, the central, whether you call it it, whether you call it the central analytics team, it becomes more of a policing organization than an innovation organization. So how do you take that customer first attitude and bring that to your community from that central location? Yeah, I would. So I would say doubled down on that with with Charlie, which is, you know, we see this happen pretty frequently, where that starts to be a behavior in an unhealthy organizational dynamic. And I mean, to put it really sort of, directly, it's a pretty terminal strategy because at some point like people because customers will go elsewhere, even if their internal and we're in this stage right now, where the central teams and in the service provider teams have a lot of leverage because of concern around David data privacy and governance and data leakage and so on. But the encouragement we would i would generally provide is to not misuse that and abuse it because at some point markets Even then, you know, the small internal markets will correct themselves inside of the organization. And so, you know, the I think the, the way to, to think about this, especially for those who are testing out and you know, testing the waters and in shadow it and try New technologies and so on. One of the pieces of guidance we always provide is, don't make your make sure you don't paint yourself into a corner. Like any technology that you are trying. Is it? Is it still enterprise grade enough that it could actually be adopted by the broader organization? Does it have the right security and governance capabilities? Does it have the ability to integrate into your broader ecosystem in some way, right? You don't you want to make sure that you're not trying to introduce a technology that fundamentally traps you, because that's a surefire way of getting a lot of resistance from from it. And this is certainly what we've found with a lot of our customers is as they experiment and explore with technologies, whether it's ours or others, you know, finding some really cool use cases that prove out a lot of the value but then still helping them come back in and even talk to the central teams and say, hey, look at all these other security safeguards. How we can do this at you know, as Charlie was describing this hub and spoke model of data sharing sort of data governance, where we each have our little pods of data sets, but we can publish back to Central teams and have proper governance on this. So that they actually can be come a really cool advocate for how to introduce new technologies back into the broader organization that everybody benefits from. I think if you if you kind of think through it, with that mindset, it's a really collaborative approach for how both the sort of business units and the central teams can work together. Well,
Tobias Macey
0:24:29
yeah, there's an analogy that I've heard in a couple of different contexts of the pioneers, settlers and town planners, in terms of the lifecycle of innovation from a technology perspective, where the people who are embarking on these different projects of bringing in different stacks and different tools are the pioneers where they're going out there, exploring what's available, and then if they find something that's useful, then the settlers on the team will be the ones who adapt it to say, Okay, well we've got This may be bleeding edge tool or technology, how do we actually make use of it in a somewhat more stable fashion? And then eventually that gets handed off to the town planners who are maybe the central service organization who determine how do they incorporate that into the rest of the organization and the rest of their technology stack? How do they make it scale and make it available?
Sean Knapp
0:25:22
Yeah, I really like that.
Charlie Crocker
0:25:24
I like it to you, you end up though, in some cases where you've got a really nice town that's been built out, but there's somebody who's, you know, not going to the planning department is building up a little hazardous waste dump somewhere. And at some point, some point somebody's going to find that hazardous waste dump and and and they're gonna have to deal with it. Or you end up with Boston where it's a nice city, but
Tobias Macey
0:25:45
you can't find your way anywhere.
Charlie Crocker
0:25:49
Or you can't afford to live anywhere there. Right. So, yes,
Sean Knapp
0:25:53
yeah. I think it's I think there's something really cool about that approach, though. Which is, yeah, like, it was. Okay, so it may be problematic. But if you can actually keep it well contained, right, it's kind of, you know, to continue to is your analogy, it's kind of out on the frontier, as opposed to actually in the town, right, and you're in a good process of collaboration between Central and the business teams is to be really careful what you allow to be introduced back in, like, you don't want to introduce technologies that could infect the broader sort of town architecture, if you will, but giving those teams some degree of freedom and flexibility.
Tobias Macey
0:26:32
And so one of the interesting things to explore as well is that there are these tensions that exist between the priorities of the different groups within the business and the different projects that get spawned as a result, but once you have identified or somebody has introduced some new tool and presented it to the rest of the organization, what are some of the useful strategies for removing the friction that exists in the organization that causes them to go out and build those new tools in the first place, or maybe try to hide them? And how do you incorporate those new platforms into the organization and make it easy to integrate or extend the services that are available to make it so that you maybe use a different compute framework, but you're not trying to reinvent the definition of a particular metric, and you're able to rely on some of the Master Data Management or compliance and governance strategies that exist without them being too rigid?
Sean Knapp
0:27:35
Yeah, you know, I think playing off of Charlie's comments earlier, which is you know, if you're a central enablement team, really at the end of the day, your customers are the other business lines of business. And there's really great lessons to be learned from software vendors and enterprise SAS vendors in general, which is, you know, it's not just about creating a technology and then writing a white paper on it and email it out to the you know, the rest of the audience. Near ignored, but it's really around how do we actually help onboard different teams? And so we've seen and even helped our customers do a variety of different things from, you know, how do you create centralized teams of excellence, but district they drive distributed innovation. So create a team that is literally the SWAT team that orbits in goes from org to org org and essentially drops in and says, What is you know, your biggest, hardest most painful in our world data pipeline? And how do we help you migrate this to really cool new tech stack fast and efficiently and will train you in the process. So there's approaches like that all the way out to it, which is great for the central team too, because you get to go drop into a lot of really cool businesses and see the end use cases you may not otherwise get to see and help them build stuff out fast, which is super fun. The other thing we've seen is we've seen companies literally do the big, you know, once a quarter or once a year hackathons and they're weaving in themes. And so that becomes a way for them to Hey, we just have brought in, you know, three new, really powerful technologies, we're going to focus the hackathon on use of these new technologies. And we're going to bring in those vendors or how our centralized, you know, SWAT teams. And they're going to be these orbiting teams to really help everybody just massively ramp up within 24 hours on how to build something really cool and magical. And both these strategies we've seen be really effective.
Tobias Macey
0:29:23
And then shifting gears a bit. We mentioned at the outset that some of the reason that shadow IT projects, particularly in the data and analytics space are starting to become a bit more prevalent is because of the availability of these different cloud tools or, you know, one click provision applications or easy to use databases. So I'm wondering what types of tools or platforms in particular are well suited for being provisioned by people who don't necessarily work in a primarily engineering role or for people who are not necessarily Looking for a end to end integrated solution, they just want something that they can start using in conjunction with existing tools. And some of the potential pitfalls that exist as a result of these tools being so easy to use, and maybe the people who are initially setting them up not having the context or training necessary to be able to foresee some of those potential problems.
Charlie Crocker
0:30:24
I'll start with the, what we found was, in many cases, you have a set of master data. So people have access to business systems data or some set of product usage data. And then it's very easy for people to go out and get analytics tools, right, go out and get your Power BI go out and get your Tableau go out and download some of that data into Excel. And then they start to do really great work. They think about it and they generate a lot of really cool metrics, great reports, etc. But once again, you start to because those tools are easy to get to very well marketed into organizations, you start getting lots and lots of little silos, everybody's desktop becomes a silo. And so that's that, that that in itself, just that then bi layer, for example, on top because of the ease of access to the data. And because of the ease of access of sharing that the Excel files being moved around, those things become huge government governance, headaches, and also make it really difficult to do any provenance and understanding of the value of the actual information that's coming out of the that that effort.
Sean Knapp
0:31:42
Yeah, I agree. And I think, you know, even building on top of that we see, you know, as Charlie's describing the introduction of a lot of, you know, SAS vendors and bi vendors do a really good job of nailing that, you know, how do you make it pay as you go? How do you make things that are super simple to connect into your existing ecosystem and And so on, which is great to start, you know, I think part of the balances finding the those tools and technologies that of course have that that easy capability to get up and going. But similar to what I was highlighting before, still enterprise grade enough that at the same time, they're extensible and can be hooked into the rest of your ecosystem and are sufficient for far more advanced use cases. And I think that's the the, the nuance of like, figuring out where do you find those and how well do they work? So you don't end up with this, like massive proliferation of silos. And I think, at least SAS does a pretty good job of helping to solve parts of that problem. Because it's much easier to unify access and take what started as just a couple of use cases, and expand it to a bigger organization without everybody sort of sitting in their own little pockets.
Charlie Crocker
0:32:52
Yeah, you brought up earlier to Tobias, the education piece, right. And so a really good support. 14 very customer focused central organization. well supported BI tools and data engineering tools and data science tools, if you can get that ecosystem built out and that support network built out, and you can help the people within that organization feel like they are the champions, and that they have some flexibility in that, then you can start to drive a lot of those standards, right, a lot of the shadow it is, you know, it's like I said earlier, it's trust and its lack of communication. And so how do you, you know, deliver, not just a tool that is easy to use, but a support team and an organization that helps people feel like they're, you know, they're always two steps ahead. They're not that they're being told
Tobias Macey
0:33:48
no, at every turn. And so that communication and support piece, do you think it's just a matter of saying, Hey, we're here and we're available ask us questions. Is it a matter of habit? substantial documentation for people at different levels to be able to access and follow. Is it a matter of publishing the availability of different data sets and making them discoverable using maybe something like the Amundson tool from Lyft? Are there any other elements of that sort of support strategy that you've seen as being effective and productive?
Charlie Crocker
0:34:20
The big thing is you need top down support to if you don't have top down support for driving some of those standards, then that behavior will never change, no matter how well you do in that. So you get top down support, you have leadership that at the next tier down, that is held accountable for that you can start to dry drive that there's lots of different tools, data discovery is one of the hardest I've been you know, there are a lot of tools out there. I don't even know what the the newest ones are right now. But data catalog in finding what is pertinent, what is the most valuable data, you know, how do we easily do that in a In a big organization that's even got standards, right. So that that that's that's a huge thing. But there's a lot of documentation. There's a lot of learning there's, there's great tools for doing data discovery, but you really, really need the top down support for driving that consistency and clear reasons goals for why we're doing this. We're doing this because we want to be a compliant company. We're doing this because we are stewards of our customers data. We're doing this because we feel like we can drive innovation and scale better as a company. We're doing this so we don't have redundant people doing redundant jobs all over the place. There's a lot of reasons why you do you drive people towards that and if you're not clear on why people are not really going to understand
Sean Knapp
0:35:53
Yeah, I'd totally double down on that. And honestly, the once you have more top down, support it Which is really where you need the executive, or at least very senior level sponsorship, because a lot of the technical decisions will be made much deeper in the organization. But at a more senior level, what you're going to get is the amplification of this importance. And those are also this sponsors you can get internally to then you do these really cool things like create these rotational teams to go spend time with the internal customers or do these hackathons. These are the things that allow you to do these outsize impact moves and really short timeframes that, you know, as you decide on some of these new strategies, frankly, help them see the light of day at a much larger scale much faster, which everybody's super interested in doing these days.
Tobias Macey
0:36:47
And are there any other just inherent complexities in the overall aspect of data management and the available technologies that you think we need to see resolved and addressed more effectively In order to reduce the tension that exists between the organizations and the different business units that lead to these bespoke solutions,
Sean Knapp
0:37:07
you know, I mean, My take is in device. We've talked about this a little bit before, I'm a really big believer and moving a lot of the technology systems from imperative to declarative. And the primary reason is, you know, we have pockets of technology in the data space. Right now we have storage engines, we have processing engines, we have data catalogues, and data warehouses and governance systems. But we don't have systems that actually connect all of these together. And so you always talked earlier about the importance of metadata, metadata, most of the metadata management, really, in most companies is being done pretty manually today. And even if I have automated systems, like I'm doing, for example, automated data replication or pipeline orchestration, and so on, and usually automate Something that says do a job at a particular point in time. But we don't really have a tight linking between this piece of code ran on this piece of data that produced this other piece of data. And here's why it did it. And as a result, because we don't have this tightly bounded notion that's highly automated, at the metadata layer, we push all of this burden to people. And that's why we do it more mainly, like very few companies have any at all these days can actually say, is that data in your warehouse, your database, your Lake, actually reflective of the code in your repository? It's probably reflective of some version of your code at some point that ran and some version of the data but but you actually don't know how and why you'd up throw an engineer at the polliwog like, go ahead, go audit of the stuff and figure out why. And so, as a result, I think this is one of the things that just missing the today is still the smarter systems that are looking more holistically across the day. And code landscape, they can actually track and trace. Because in doing so now we actually can automate more or get more of that burden out of the hands of engineers that are trying to manually do it, which accelerates the cycles and makes it easier for people to move a little bit more freely.
Charlie Crocker
0:39:17
Yeah, I mean, I'm gonna double down on what you're saying there. And I know, Sean, you and I've had this conversation many times, but visibility into all of the work that is happening, all of the loads, all of the processing, the cost for the processing, which data sets are, you know, being used by hundreds of pipelines, which data sets we don't need any more streaming information, what data is scheduled to be retired and deleted? You know, it's really difficult for a you know, a non technical person who may be making some of these decisions like, you know, product Leaders or, you know, working with their legal counterparts, it's really hard to get that, that view. So there are a lot of good tools out there that are starting to sniff and consume logs and place, you know, points and information along various parts of the pathway. But it is not a solved problem yet.
Tobias Macey
0:40:22
Are there any other aspects of this topic in shadow IT and some of the motivating factors and possible solutions and the reasons for attention and ways to try and overcome that, that we haven't discussed yet do you think we should cover before we close out the show?
Sean Knapp
0:40:39
I think the, I think, kind of going back to core first principles. And, you know, as as mentioned a little bit before, DevOps is all about ultimately at a very high level, how do we enable more people to do more with software faster and safely? And that's really the sort of Fighting against anything like that is like fighting against gravity. And the same thing happens when it comes to data. And a lot of the even the conversations around data opposite day that and it's still forming and there's a lot of opinions and people are trying to kind of push it in a bunch of different directions. But really, at the end of the day, it's going to come down to how do we enable more people to do more things with data faster and safely, and it's going to be the same sort of core business drivers. And I think for a lot of teams, the specific nuances of how you accomplish that inside of your organization may be very specific, very contextual to your world. But trying to actually reroute that momentum is really hard because it is very much like fighting against gravity. And figuring out how you can enable that and oftentimes the you know, we do the same sort of exercise and process and how we build here at ascend, which is Gosh, like, if you can Through the we want to enable this. And that would make the organization much better. And there's always another hard technical problem at the other end of that to solve is we go solve that problem and figure out well, if we can solve that, though, that will enable us to be a much more efficient, efficient and effective organization. And so we just we adhere to those core principles and kind of keep trying to knock down the next challenge, to really help the team go faster and faster, safely.
Charlie Crocker
0:42:28
Yeah, and I'll come back to that. I think that in many cases, the hardest problem is not a technical problem. It's a it's a people problem. So you've got to figure out how to motivate them to figure out how to put them customer first, you've got to help people understand why there's value in working together and you gotta support them.
Tobias Macey
0:42:47
Well, for anybody who wants to follow along with the work that you're both doing or get in touch, I'll have you each add your preferred contact information to the show notes. And we addressed this a little bit at the end here, but if you've been Got anything to add on your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today? I'm happy to hear it.
Charlie Crocker
0:43:08
I'll repeat what Sean said around visibility into the overall pipelines. They're still not the perfect tool for understanding what data is the right data to use, where it came from, can you trust it? And being able to get a view into that overall picture? And I'm a visual guy. So I'm not talking about, you know, code and databases. I'm talking about a way to really see this so that you can have an intuitive understanding of what's working and not working.
Sean Knapp
0:43:44
Yeah, and I'd say the clearly at ascend. I mean, we're really big fans of highly automated systems that track tons of metadata. It's what we do for data orchestration and autonomous pipelines. We also have a pretty UI, Charlie. So you know, You like our UI, but the it's a, you know, as Dina engineers, Tasha, we've been throwing our weight and intellectual horsepower at solving so many other problems for so long. We have self driving cars, we have incredible machine learning algorithms, we can store more data and process more data and move more data faster than ever before. And so the my big takeaway, in a sense, big take here is, we can now apply that same intellectual horsepower and it around how do we make it easier and faster and more automated to build data pipelines and maintain them and manage them at scale? And that in doing so helps us solve a lot of these other problems? And I think that's much more of a how do we have highly intelligent data orchestration technologies out there today? And so that I think that's the next big frontier for for data engineering.
Tobias Macey
0:44:50
All right, well, thank you both for taking the time today to join me and explore this space of shadow IT and the data and analytics space. It's definitely something that I'm sure a number of people have either engaged with or had to deal with at some level. So it's definitely an interesting topic and one that's valuable to address. So, thank you both for your time and efforts on that, and I hope you enjoy the rest of your day.
Charlie Crocker
0:45:14
Alright, thanks, Tobias.
Sean Knapp
0:45:16
Thanks so much for having us.
Tobias Macey
0:45:23
Listening, don't forget to check out our other show podcast.in it at python podcast.com. To learn about the Python language its community in the innovative ways it is being used to visit the site at data engineering podcast.com Subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects in the show, then tell us about it. Email hosts at data engineering podcast.com with your story, and to help other people find the show. Please leave a review on iTunes and tell your friends and co workers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!