Organizing And Empowering Data Engineers At Citadel - Episode 109

Summary

The financial industry has long been driven by data, requiring a mature and robust capacity for discovering and integrating valuable sources of information. Citadel is no exception, and in this episode Michael Watson and Robert Krzyzanowski share their experiences managing and leading the data engineering teams that power the business. They shared helpful insights into some of the challenges associated with working in a regulated industry, organizing teams to deliver value rapidly and reliably, and how they approach career development for data engineers. This was a great conversation for an inside look at how to buld and maintain a data driven culture.

linode-banner-sponsor-largeDo you want to try out some of the tools and applications that you heard about on the Data Engineering Podcast? Do you have some ETL jobs that need somewhere to run? Check out Linode at linode.com/dataengineeringpodcast or use the code dataengineering2019 and get a $20 credit (that’s 4 months free!) to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Michael Watson and Robert Krzyzanowski about the technical and organizational challenges that he and his team are working on at Citadel

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing the size and structure of the data engineering teams at Citadel?
    • How have the scope and nature of responsibilities for data engineers evolved over the past few years at Citadel as more and better tools and platforms have been made available in the space and machine learning techniques have grown more sophisticated?
  • Can you describe the types of data that you are working with at Citadel?
    • What is the process for identifying, evaluating, and ingesting new sources of data?
  • What are some of the common core aspects of your data infrastructure?
    • What are some of the ways that it differs across teams or projects?
  • How involved are data engineers in the overall product design and delivery lifecycle?
  • For someone who joins your team as a data engineer, what are some of the options available to them for a career path?
  • What are some of the challenges that you are currently facing in managing the data lifecycle for projects at Citadel?
  • What are some tools or practices that you are excited to try out?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:11
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at linode. With 200 gigabit private networking, scalable shared block storage and a 40 gigabit public network, you've got everything you need to run a fast, reliable and bulletproof data platform. And if you need global distributions, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai, and for your machine learning workloads, they just announced dedicated CPU instances. Go to data engineering podcast.com slash linode. That's l i n o d today to get a $20 credit and launch a new server and under a minute and don't forget to thank them for their continued support of this show. You listen to this show to learn and Stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet, listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, cranium, global intelligence, a luxio. And data council could have data engineering podcast.com slash conferences to learn more about these and other events and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey. And today I'm interviewing Michael Watson and Rob cousin asked about the technical and organizational challenges that they're facing at Citadel. So Michael, can you start by introducing yourself?
Michael Watson
0:01:39
Yeah, no problem. Soon as Michael Watson, I'm the director of the data engineering organization here at Citadel and head of the enterprise data team. been here for about five years and longtime listener of the show. So really excited.
Tobias Macey
0:01:54
Rob, how about you?
Robert Krzyzanowski
0:01:55
Yeah, I'm Rob k work directly with Michael. He's actually my boss. I lead one of the data engineering team. Here at Citadel and we work closely on both kind of tactical and strategic elements, I guess the team. And so I drive a lot of sort of the initial data initiatives.
Tobias Macey
0:02:09
And Michael, do you remember Hey first got involved in the area of data management?
Unknown
0:02:12
Sure. So like looking back over the last like 10 years since I graduated undergrad, I feel like every like step along the way has involved dealing with data and research process one way or the other. I learned Python as an undergrad. And my first introductory computer science courses kind of just been like full steam ahead ever since in like along the way, like working at like research companies first in like market research. And now within the context of a hedge fund. Data is the lifeblood of how we like make our decisions and now how we make investment decisions and kind of managing the lifecycle of that information from get raw data from a vendor wat raw data from a web scrape and then transform that into a piece of information that can be consistent Buying investment team and then have that transform into an actual investment decision. It's pretty much exactly what we do within the hedge fund. And so like getting involved in the data management of that is almost like inherent to like how we just run our business.
Tobias Macey
0:03:15
And Rob, do you member how you first got involved with data management?
Robert Krzyzanowski
0:03:18
Yeah. So going back all the way through and my academic route. So initially kind of pursuit of more academic career starting in pure mathematics before switching over to industry, I had actually done both Mathematics and Computer Science back in undergrad. So it was a very natural transition. Initially, I focused kind of heavily on full stack web development before transitioning more to machine learning infrastructure and building machine learning models before joining Citadel. And then upon my transition here, I kind of took over a lot of the more tactical initiatives in the data engineering space that we're working on here. And in terms of how that kind of intersects with my background and found it's been, it's been kind of great to be able to interface with both the analytic facts and so the engineering aspects. And so,
Tobias Macey
0:04:03
as you mentioned, Michael, one of the main aspects of working in a hedge fund is the fact that you have all of these different data sources that you need to be able to incorporate to ensure that you are making pertinent decisions given the portfolio that you're dealing with. So, before we get too much into the specifics of your data engineering practice at Citadel, can you just give a bit more background about some of the role that data plays in the overall business of Citadel and hedge funds in general?
Unknown
0:04:32
Yeah, totally. So to understand Citadel, it's good to know that the top level there's two different sides of the business one is Citadel securities, which is a almost a quantitative trading firm that acts as a market maker working a lot of very interesting industries, but that's actually kind of like a separate organization from the hedge fund itself. So if you google Citadel mighty Citadel securities, you might see pseudo hedge fund for pretty much everything I'm talking about today. It's about Citadel, the hedge fund and in In that context, it's something that's called a multi strategy hedge fund. And when you work with incident that has a big impact and how the organization is laid out, so each one of those strategies, invests in a specific asset class and has engineers and technologists aligned to traders, portfolio managers and quantitative researchers working on like their bespoke use cases. So the different strategies are a fixed income business, a credit business, a commodities business that's dealing with energy products, as well as agricultural products, a quantitative strategies business and then a long shore equities business. And that's where I've spent most of my career. And that's teams of portfolio managers that are constructing their own portfolios. So it's a bunch of different stocks in one sector. So let's say one sector could be technology stocks where they're constructing for folio, let's say Apple, Google, IBM, Microsoft. And they have to construct it within a specific risk model, make sure they have the same number of long positions a short position, so the market goes up or down, it stays relatively neutral. And then we have analysts that are reading the Kaizen cues, understanding the fundamentals of the companies that they're investing in very intimately. And then we will pair them with incredibly strong data engineers with strong fundamental computer science fundamentals, that are also right with communication that can understand why are we possibly looking at a specific data set in the context of a company that we might be investing in and see around the corners of things that could change in the data and build in safeguards and different pieces of an ETL pipeline that will protect us from some sort of changes in the datasource. And we have five different teams for example, within The equities business where we will have engineers sitting alongside investment teams working through their different data use cases, alongside a role called a sector data analyst to ingest data from the outside world, turn it into insights, and then hopefully they can convert that into a money making investment opportunity.
Tobias Macey
0:07:20
And so you mentioned that there are these different groupings of engineers for the different types of business verticals that you're working with. I'm wondering if you can talk a bit more about the way that your data teams are structured and how the data engineers in particular fit into the overall life cycle of the decision making and incorporation of the data that is used to drive these different business decisions.
Unknown
0:07:44
Definitely. So right now we have five different data engineering teams and kind of like a hub and spoke model where at the center, there's Rob case team that works for the Core Data engineering where they're dealing with a lot of the infrastructure that we might be using, like our here infrastructure, our Jupiter hub deployments, as well as managing sort of like our event driven processes and dq m systems, creating tools for other data engineers that are sitting on the trading floors with the investment teams. So that role has a little bit more of a software engineering tilt. But it's very business focused and is very business aligned with the investment teams. And then the three other data engineering teams that are working with the investment teams in the equities business are a little bit more commercial, y'all can almost think of them as like a data engineering consultant, where they will work on two to three projects at a time that could last anywhere from two days to two months. That is that corresponds to an investment idea. So let's say somebody is trading some company that's buying and selling oranges in Georgia, and the weather patterns of Georgia might have a really big influence on the number of oranges that they might sell in a given quarter. So The engineer might look to look at all of the history, like first find out where all the distribution centers and the retail locations of that orange distributor, and then align that with historical weather patterns in those locations, and create some sort of signal like the number of days that it was sunny over a Friday, Saturday and Sunday in a given quarter, and see how well that might line up to historical sales of an orange distributor. Again, this is just like a hypothetical scenario where there's not actually an orange distributed remodeling, you can imagine that you can extrapolate this on other sectors of the economy. And those engineers understand some to an extent the fundamentals of the company, but they're just really focused on best practices when building out that pipeline for data how it gets from the outside world, whether it's a web scraper or an outside vendor, normalize an ingestion internally and then create the interfaces that investment teams would then use to consume that neither Jupiter notebook or Excel or in our Python or c++ libraries. And then we have 115, which is the enterprise data engineering team. And they're working a little bit more with your traditional types of data. So like your market data, your pricing data, data about the different types of securities you made investing that we're getting from myriad different vendors and making sure that flows into all of our internal systems that we refer to that when they actually go to make a trade,
Tobias Macey
0:10:27
and how has the overall nature of the responsibilities and the work that the different data engineering teams are doing evolved over the past few years at Citadel, as the tooling and capabilities have improved for being able to manage this data. And as more sophisticated analysis techniques have become more mainstream in terms of machine learning and deep learning, and some of the different requirements of data volumes of data quality has evolved and increased As a result,
Unknown
0:11:01
I would say the one thing that I that it took us a while to learn is you need a engineer, especially in the data space sitting directly next to the end user of that data set. I think maybe like looking back maybe three or four years ago, we would have like an investment team saying something like, I need to get location data about oil tankers. And they might then send that over the fence to some engineer that's maybe sitting on a different floor, even in a different office. And they then have to guess why might we be using this data for modeling our oil tanker movements, they then transform that thrown into a table and throw that back to a investment team on the other side of the fence. And they might say, this is this data structures in no way enables me to do like the type of time series analysis that I would want to do. So over the course of the last two years, we've brought The investment teams in the engineers much closer together where they're sitting on the same floor side by side. And you get a much stronger back and forth in terms of dialogue and idea generation, when the you have like an engineer sitting directly next to an investment professional or a trader, and in have that free flow of ideas. So like the the engineer can see what's coming next. And they understand how they're using the data. And some of the ideas are going to start to come from the engineer opposed to just from the end user, that's an analyst or an invite somebody on the investment side.
Tobias Macey
0:12:33
Yeah, imagine that that has impacted the overall hiring strategies as well because of this strong correlation between the quality and capabilities of the teams, as a unit with the engineer embedded with the traders, versus some of the trends in engineering organizations where they're trying to push more for the ability to have remote engineers because of the communications technology. that we have now, where, because of the fact that you have different business offices probably across the world that are likely focusing on different companies or a different business verticals, you then need to have engineers who are able to work closely with them on the particular data types that they need. Whereas in a different office, it might be a completely different set of projects that they're involved with. So I'm curious how that has manifested in terms of how you focus your hiring strategies and how you focus the types of skill sets that you need to have within an office to be able to ensure that you have a well rounded capability.
Unknown
0:13:39
So one of the things that I think makes a technologist really successful at Citadel is that they have an innate interest in understanding financial markets, and they want to know how they can leverage their engineering skill sets, to be able to understand something about the world that maybe no one else is has figured out yet and get the validation of having that turn into a successful trading idea. And the engineers that have that innate driver like that actually resonates with them are the ones that are going to be most successful. And when you put somebody like that directly next to a trader and give them those two an opportunity to have new idea generation and new approaches to problems that maybe have already been solved and failed, but like a new set of eyes, can give a new perspective on has been incredibly successful for us. So like, if you were to look back maybe two or three years ago, we tried to start building out the data engineering or within Citadel, it was much more of a like remote engineer where there was requirements thrown over the fence, the engineer within try to understand how that would translate into like ETL pipeline or a type of analysis that they're guessing and investment team might ultimately want to do. But because we never had the opportunity for them to sit side by side, the engineer was constantly guessing. And by bringing them in house and sitting directly next to each other, we've seen like an incredible amount of growth and the value that we're producing from the data engineering team. And just new ideas coming out left and right. And so the engineers that are like that are that have that innate interest in finance and understanding financial markets, but also really strong underlying software development skills are the ones that we have found that have moved the needle more than more than anyone else? I think that if somebody is interested in for technology, technology, safety, there's plenty of roles and opportunities within Citadel or many other firms to be successful, but specifically within the data engineering space are we we try to sit as close to the businesses, we can understand how we're actually going to try to use our data and the investment process that having engineers with that innate drive for understanding financial markets and that commerciality to sit with a and sometimes non technical business user has been the single biggest evolution that we've gone through over the last couple years to end up in the the organizational structure that we have today.
Tobias Macey
0:16:28
And then in terms of the types of data that you're dealing with, you mentioned that you might be pulling from things like weather information over a certain period of time as it pertains to business that you're looking at investing. And then you might also be dealing with market data. So I'm curious if you can just talk through some of the sort of categories of data that you're dealing with, and some of the process that goes into identifying which sources are valuable and then evaluating them for quality and maybe potential bias and then Ultimately incorporating them into the overall flow of data that you're using to drive these different decisions.
Robert Krzyzanowski
0:17:06
So I'll take that one.
0:17:09
I think one of the things that differentiates us in terms of the challenges that we face around our different categories of data sets, how we evaluate them, and how you the value of data product overall is that we're operating in a space where we're looking at every sector of the economy. So that means sectors like energy, industrials, healthcare, financials, consumer facing services and products, technology, media and telecommunications. And in order to effectively be able to operate across all those sectors, what we'll do is we will condense down to the critical thing that we're trying to predict, such as a top level line item for a particular public company, and then we'll take different permutations of a data set that may be applicable to that company. So we might take a mean average, Max been kind of different permutations of the data set and then Run a correlation or a univariate regression to determine what is the best predictor for that particular company. And at the end of the day, this really ends up running into challenges where you're working maybe with a 200 terabyte data set, or you're working in a data set where there's a really high uptime guaranteed. And what that translates to is kind of systematic framework that we've constructed to be able to do that in both a structured way, but also apply it kind of broadly across these different categories.
Tobias Macey
0:18:39
And one of the things that is always a challenge, particularly when you're dealing with a lot of different types of data is just understanding what data you have, particularly when you have these multiple different teams that might be able to take advantage from a common data set, or one team that has this bespoke data set that they're dealing with. And so I'm curious what you have In terms of sort of common infrastructure for being able to handle data cataloguing, and annotation on the data for being able to understand what is the purpose of this data, and what is that context that you've been able to capture, and then some of the other processing infrastructure that you have available to these different data teams, and then when it's necessary for them to be able to spin up their own custom infrastructure for handling a special case that they're dealing with?
Unknown
0:19:27
Definitely, yes, I mean, that like the name data catalog, kind of like is kind of near and dear to me, because I built one of the first data catalogs we used within Citadel going back to 2016. And one of the challenges that we have within Citadel from a data engineering and data management perspective is the importance of secrecy and the importance of privacy. So if one team is looking at a given data sets, they don't necessarily want anyone else in the organization to know what that data set is. Or even that they're looking at it at all, because that then kind of gives up some information. And so one of the biggest challenges that we have actually is managing those permissions across the organization. So that one, you can make data that you don't have to re engineer the wheel every time but to you can kind of respect the privacy and the permissions of somebody that was the first, the first comment or the the first mover on a given data set. So it's always been a challenge we've had, but we do have a lot of our data sets catalogued in an internal system, that we have specific permissions around that tie into, like internal permissioning. Somebody could search and then discover some of those datasets. We also have a very large data management team that is constantly going out into the world and cataloging discovering new types of data that they are then sharing with internal stakeholders internal teams, so They can know what it is like the new data sets that are coming out of the market. And their job is to also kind of like, disseminate that information internally. In terms of like our shared tooling and infrastructure, heavy users of airflow. Pretty much every one of our data sets corresponds to a given airflow dag, and a moto repo so that when a user wants to work on a new project, we have a library that can it's called kickstart. It'll create a new directory within that mana repo and might create a new schema associated with that data set. It'll create the raw templates of either a web scrape or an ETL system. And then that kind of like kicks off. It also make connects to Olympic and run some of the the DDL statements for creating the necessary schema. And then, as they develop that pipeline, there's a dev branch that corresponds to our dev air flow. Our dev databases once that gets merged into Master, and then gets promoted to the, the prod airflo server with the prod jobs and then updates the prod tables from the Olympic migration and goes from there. So that's like having that mapping of dag to data set to schema. And then to an entry in, like our internal data catalog has been kind of like a really powerful unifying factor for dealing with the thousands of different data sets that we we deal with, but it's still something that we continue to work on and improve, like, what are the corresponding dq m checks associated with each one of those new data sets? What are the different data access layers corresponding to each one of those data sets because we do have a pretty robust data access layer that sits on top of that. So once the data gets loaded into and normalized into those final sequel tables, Everything we have ends up in SQL, what are then the API? What are the queries that kind of sit on top of that, that we template eyes that allow somebody to go into Excel or Python or Jupiter or Tableau and exact and extract the exact same view of that data and all like the downstream systems that we do our analysis. And so getting getting the essentially like our ETL framework in the sequel schema tables, working side by side along with a cataloguing framework and a data access framework that all kind of point back to the common data set or the common concept of a data set. There's been really helpful, I do think there's probably more I think that there's absolutely more that we can do there to try to unify all of those, but it's something that we've been pretty successful in so far. And we just continue to push the button and like what what are the new integration points that we need to help get our like? Time to analysis as short as possible.
Tobias Macey
0:24:03
And then another challenge in this overall space, particularly because you have so many different teams. And then such a broad scope as far as the number of different offices and number of different business areas that you're dealing with is the overall aspect of managing the growth and maturity of the team, both in terms of hiring, which we discussed earlier, but also in terms of ensuring that engineers stay happy because they have some prospect for growth, whether that's in terms of the projects that they're involved with, or the responsibilities that they have, or some sort of promotional ladder that they have the option of climbing. So I'm curious how you handle career development and overall team management and cohesion, giving the number of different teams and offices that you're dealing with and the size and scope of the business that you're working in.
Unknown
0:24:56
That's actually something that we've spent a lot of time thinking Actually over the last year, because we haven't org right now, where we have data engineers in London, Chicago, San Francisco, Hong Kong, that are working on a myriad of different types of problems and aligning their skill sets as they progress through their careers is really important for us. One thing that we're starting with this year is the creation of an entry level data engineer role that's going to work on our enterprise data engineering team, where the work underneath a really strong software engineering manager as the team lead, and really focus on core software development skills within the ETL systems for our enterprise data. And that's the data feeding into our different reference systems about all the different investable instruments that we have within the firm information about pricing data, a lot of your traditional market data. And as they really develop their software development skills within the data engineering environment, we want to give them exposure to our business data engineers that are working directly with investment teams, so that they can also develop their understanding of how we're using data in an investment environment and for a specific investment thesis. So over the course of one to two years, they not only are at a point where they have really strong development skills and understand all of like the tooling and infrastructure around our ETL systems. They also started get understanding of what does that all mean in the context of an investment strategy. So that's if they want to after around two years, they can go in a direction where they're starting to work directly directly with an investment team on a trading floor. Alternatively, we have the Core Data engineering team that is a little bit more of a software engineer Till that is responsible for a lot of the core infrastructure and tooling that we have around data engineering. The other business data engineers are using to manage things like our airflo infrastructure Jupiter hub environments, a lot of our event driven and ETL system Iran Kafka, lover, dq and management systems and like our data engineering and validly our data evaluation frameworks, and if they want to go a little bit more deeper until like a software development career path thing that could go on to that team as well. But once you already kind of established mid career as a data engineer, we do have additional trajectories to go like deeper on the individual contributor route where you want to go more from a data engineer into a data architects type of role. And there we have really strong data engineers that do more architecture and design. So that when there is a complex problem around a like a difficult spot ETL system that requires a lot of like tweaking of the JVM, the go to resource for that or there's a team that has a really high throughput of data going through Kafka, they'll be like the go to resource that the business data engineers can kind of lean on and they're they're kind of seen as like the the wise data engineering expert to go for some of these more bespoke problems. On the flip side, we also have plenty of opportunities to grow as a data engineering manager, just really like the the pace that we're growing within the org. There's constantly new teams that are developing. And I think one thing that Citadel does really well is prepare technologists early on in their career for leadership opportunities. We have a really good internship program where interns are constantly coming in throughout the year and we pair these really strong college freshmen, sophomores, juniors, people that are in grad school with some of our best early Career engineers and give the engineer within Citadel an opportunity to start lead in realize what it's like when you mentor somebody that's more junior and be there to answer questions and coaching teach and just be been a just a nice person and help them along, or even within our rotational program. So when somebody joins Citadel we have this program where you work in different team for every four months throughout the course of the year in pair them and the rotational program with a really strong engineers to start getting them more management experience there as well and then start career path towards a management role. So in terms of like taking everyone from really strong, mid career architects that into the data engineering space to new grads that they need a lot more coaching, we try to focus on creating opportunities for everyone across that space to continue to grow and if we can do that in the country, taxed of the most important thing in that is finding the best returns that we can. And each one of the markets that we invest in, that we can create not only like an incredibly successful Hedge Fund, which is always going to be the first and foremost goal of why we were here. But we can also allow people to kind of like grow as a professional and as a technologist, and in finding that sweet spot where both of those are directly aligned, is like, I guess part of my job in helping kind of lead this team. And I think that's something that we're actually doing. We're doing this we're doing, we're doing a good job. And we need to continue to reevaluate how we continue to do great to do even better there. But that's something that I'm particularly kind of proud about that we really try to focus on at Citadel did engineering.
Tobias Macey
0:30:50
Another challenge, particularly because you have all these different data sets and as you noted, you sometimes don't want to alert different teams of one dataset you're working with. So it's not always easy to keep a global view of what data sets are available is how you manage the lifecycle of the data from identifying and incorporating it into your business decisions to storing it and actually using it for the analysis, and then ultimately deciding when to either retire it or whether to keep it updated. And so I'm curious how you approach that overall aspect given the strict regulatory environment that you're dealing with.
Unknown
0:31:30
One of the things I want to point out is that the data engineering is only one part of our overall data strategy at sit it out, like we work and partner with two incredible groups of people. The sector data analysts, that are working closely with the investment teams on their data strategy and how they want to extract information from these data sets and incorporate into their investment strategy are critical and helping to evaluate what data sets we want to go forward with. And also the data strategies group that are world class data scientists that are also helping us extract a lot of this information and figure out like which one of these data sets have legs. And so a lot of times the data engineers will get involved, once we decide to go forward with a geek given data product. And then it's no longer a question of what data do we want to ingest eventually becomes a question of what data do we want to turn off? Do we want to no longer support because there is overhead in continuing to maintain a given data product that has a corresponding pipeline and a set of dq unchecks that will occasionally fail a lot of the the data that we're consuming is constantly changing and evolving. If it's coming from a web scrape a, there could be a total refactor of the site where like they're using a totally different CSS tax. are totally different product where they might be like, originally a static HTML page could have gotten changed to an Angular or a react application where getting the information content from that could change dramatically. When that happens, it's going to set off some alerts of one of maybe one of our data engineering ESeries might start taking a look at it. And that takes time, right? So we want to make sure that we're only reacting to data quality issues or failures or ETL systems for data that's actually making an impact. And that's when our usage tracking systems come in really handy. So every time that somebody looks at a given data points from Excel, or Tableau or R or Python or c++, we log that observation so that we end up getting 10s maybe hundreds of millions of different log entries a day. They don't Hi back to a given data asset. And then at the end of the quarter at the end of the year will say like we're now we're, we're supporting 2000 different data assets. How many of those has no one looked at over the course of the last six months, and then of the of the stuff that people are looking at, let's turn it off. Let's stop supporting that system so that when we know it's going to fail, eventually, we don't have to have somebody spend their time trying to fix it. And so we realized in order to make this a sustainable process where we can continue to grow, that it that knowing what to turn off is oftentimes more valuable than knowing what to turn on and go forward with and that's something that like we've really focused on over the last couple of years.
Tobias Macey
0:34:54
And as you continue to evolve the capabilities requirements of the data organization at Citadel what are some of the challenges, whether technical or business oriented or team oriented that you are facing and that you're interested in tackling in the coming weeks and months?
Unknown
0:35:16
Um, yes, like one of the challenges that you get when you're a investment organization that invests in like every industry amount imaginable, and every sector and every type of instrument that's been around for 20 years, is you have such a large sprawl of data that you've accumulated. And understanding what is the quality of that data across all of the different touch points is incredibly difficult. And that's something that we're looking at tackling is unifying or dq m frameworks across the different data sources. That we are adjusting from and be able to either one sleep at night knowing that everything is as it should be, or to at least you're being woken up by a data set that you know is important. You know, it's wrong. It's a system identifying it and not a person. And if you could at least have if you did this and not have any unknowns in terms of your data, you know that there's a lot of things wrong with it. But you don't have any unknowns, at least gives you a really good foundation of being able to address the different ETL systems that maybe were written 10 years ago that no one's looked at, but some production process might be looking at. And so really tackling the data, the dq m process and problem is one of for me, my biggest goals in 2020.
Tobias Macey
0:36:57
And are there any tools or practices or industry trends that you're keeping an eye on that you're excited to try and incorporate into your workflow?
Unknown
0:37:07
Yeah, totally like a, we built one library called long hair. And there's a lot of similarities to great expectations around 50. Like very early Island, Great Expectations was getting just getting going. And I think that projects come like a really long way. I think that one of the downsides with some of the expectations is that it requires somebody to enter and code for certain types of data unit tests. But for a lot of our engineers, it's really it's a really strong framework. So that's something that I personally think is a great project I'd like to look at a little bit deeper. And then there's also like industrial scale, dq m libraries that are specific for finance, some some aren't. But we'll definitely be investigating more more in that space. We have a lot of also like really good internal tooling frameworks for how you can write dq I'm test but they're not necessarily deployed across the entire stack. So like finding More unification across the different ETL systems that we have, using common tooling, even if it's not the best, we're using the same framework everywhere would be a huge one.
Tobias Macey
0:38:10
And are there any other aspects of your work at Citadel or the challenges that you're facing or the ways that you're using data that we didn't discuss yet that you'd like to cover before we close out the show?
Unknown
0:38:20
Yeah, I think one of the one of the coolest things that I've personally worked on over the last couple years is how we have integrated Jupiter notebooks with our deployment process of like analytics. So we have an internal Jupiter hub deployment that runs on hashey Corp Nomad similar to cube but it's something that we adopted relatively early on, and we're relatively mature this at this point. And then we on top of that, sort of creating a lot of Jupiter customer Jupiter plugins where an analyst can come in and click a button. It'll then take out the code from Jupiter store that with them An internal Elastic Search database. And then whenever a user references one of those specific functions that were originally in the notebook, and either Excel or Python R or c++, a process that runs within hashey Corp Nomad will read them into memory using the module in Python, and then execute it. Those functions all returned pandas dataframes, and then those returned back to the clients that originally requested them via standardized API. And what that allows for is a analyst that doesn't necessarily know how do you deploy a model? Or how do you deploy a specific data wrangling exercise, so that a portfolio manager that maybe only knows Excel can access it seamlessly, and then have the portfolio manager that only knows how to interact to excel they have no idea how to do anything in Jupiter and Python, and maybe couldn't even tell you what a CSV stands for. But if you can give them that information that exists from that analyst, they'll be able to leverage it in a way that maybe nobody else in the world can because they might be an expert in the underlying economics of the company or market that they're investing in, and creating an infrastructure that really pairs. The power of analytics that you can get from like a Jupiter notebook and Python or even our studio with some of the existing enterprise like, processes around research within either Excel or other frameworks has been incredibly powerful for us. So like we continue to kind of push the button on how can we allow Jupiter to integrate into the research process, and we're continuing to look for new ways that we can do that going into 2020.
Tobias Macey
0:40:53
Well, for anybody who wants to get in touch with either of you or follow along with the work that you're doing? I'll have you add your preferred contact info. To the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today. And I'll start with you, Michael.
Unknown
0:41:10
Sure. So I would say a unifying system that can connect the underlying tables that may exist in some anti super format to the concept of a data set, link that to the concept of a set of dq m checks, link that to the underlying ETL systems that are processing that and then ultimately link that to a set of downstream interfaces that are accessible to users, whether it be an Excel tableau, Looker or any of your common downstream formats, and have that all that tied together in one concept of a data set. And have that one concept of a data set be permission double using Active Directory, so that you could then deploy that within the enterprise and permission people to access different parts of that individual data set throughout its entire lineage. So that's something that doesn't exist today, but it would be a great benefit if it did.
Tobias Macey
0:42:27
And Rob, how about you? Do you have any particular gaps that you're feeling the pain of that you'd like to share?
Robert Krzyzanowski
0:42:33
Yeah, I think one kind of trajectory is an industry that data engineering has been headed towards is kind of mirroring the revolution on the software engineering side test driven development in just kind of starting with coordinate set of test cases and specifications that are really written as code and then iterating. On those starting with a couple tests where they may be read, then you develop sort of a sub sample prototype ETL pipeline, you pass those tests you continue to run to kind of start with a small German kind of blossom on top of that, opposed to a lot of this kind of standard data engineering processes in the industry as they exist today, where you have to keep all that in your head today, like you have to be able to lay out the raw pieces on metal. And there's less kind of an interactivity and sub sampling aspect to it that allows you to iterate as quickly and optimally as possible on the development lifecycle kind of have those data sets and really treat them kind of an end to end product or applications that you're developing. So I think there's a lot of headway, and we've made a lot of headway on that internally. And I think this is definitely something that I personally, am hoping kind of to continue seeing great developments on both kind of externally in the data engineer community and also internally here at Citadel. So that's something that I would love to have a follow up conversation on or anyone who wants to reach out to us about a new problem. That Michael was mentioned or that I've mentioned here, I would just encourage us to start that dialogue. Because we have a lot of ideas for how to solve these problems, we just need to, we just need to continue pairing the business facing data engineers with the, with those that are interested in more kind of in this approach of starting with the core tools, and increasing the leverage of each data engineer and each business user. And from there, I think we'll just be in a really good spot 2000 22,025 I really see a great future for data engineering, both here, set it all up there externally.
Tobias Macey
0:44:35
Well, thank you both for taking the time today to join me and share the work that you're doing and some of the challenges and successes that you've had at Citadel. That's definitely an interesting problem space. And it's always great to hear about the ways that people are attacking the work that they've got. So thank you for all of your time and efforts on that front and I hope you enjoy the rest of your day.
Michael Watson
0:44:56
They should last thanks less
Tobias Macey
0:45:03
listening. Don't forget to check out our other show podcast.in it at python podcast.com. To learn about the Python language its community in the innovative ways it is being used. visit the site at data engineering podcast.com Subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast com with your story and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!
Organizing And Empowering Data Engineers At Citadel 1