Summary
In this episode of the Data Engineering Podcast Effie Baram, a leader in foundational data engineering at Two Sigma, talks about the complexities and innovations in data engineering within the finance sector. She discusses the critical role of data at Two Sigma, balancing data quality with delivery speed, and the socio-technical challenges of building a foundational data platform that supports research and operational needs while maintaining regulatory compliance and data quality. Effie also shares insights into treating data as code, leveraging modern data warehouses, and the evolving role of data engineers in a rapidly changing technological landscape.
Announcements
Parting Question
In this episode of the Data Engineering Podcast Effie Baram, a leader in foundational data engineering at Two Sigma, talks about the complexities and innovations in data engineering within the finance sector. She discusses the critical role of data at Two Sigma, balancing data quality with delivery speed, and the socio-technical challenges of building a foundational data platform that supports research and operational needs while maintaining regulatory compliance and data quality. Effie also shares insights into treating data as code, leveraging modern data warehouses, and the evolving role of data engineers in a rapidly changing technological landscape.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
- This episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your AI platforms. Visit dataengineeringpodcast.com/coresignal to start your free 14-day trial.
- Your host is Tobias Macey and today I'm interviewing Effie Baram about data engineering in the finance sector
- Introduction
- How did you get involved in the area of data management?
- Can you start by outlining the role of data in the context of 2Sigma?
- What are some of the key characteristics of the types of data sources that you work with?
- Your role is leading "foundational data engineering" at 2Sigma. Can you unpack that title and how it shapes the ways that you think about what you build?
- How does the concept of "foundational data" influence the ways that the business thinks about the organizational patterns around data?
- Given the regulatory environment around finance, how does that impact the ways that you think about the "what" and "how" of the data that you deliver to data consumers?
- Being the foundational team for data use at 2Sigma, how have you approached the design and architecture of your technical systems?
- How do you think about the boundaries between your responsibilities and the rest of the organization?
- What are the design patterns that you have found most helpful in empowering data consumers to build on top of your work?
- What are some of the elements of sociotechnical friction that have been most challenging to address?
- What are the most interesting, innovative, or unexpected ways that you have seen the ideas around "foundational data" applied in your organization?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working with financial data?
- When is a foundational data team the wrong approach?
- What do you have planned for the future of your platform design?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
- 2Sigma
- Reliability Engineering
- SLA == Service-Level Agreement
- Airflow
- Parquet File Format
- BigQuery
- Snowflake
- dbt
- Gemini Assist
- MCP == Model Context Protocol
- dtrace
[00:00:11]
Tobias Macey:
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details.
Poor quality data keeps you from building best in class AI solutions. It costs you money and wastes precious engineering hours. There is a better way. Core signal's multi source enriched cleaned data will save you time and money. It covers millions of companies, employees, and job postings and can be accessed via API or as flat files. Over 700 companies work with Core Signal to develop AI solutions in investment, sales, recruitment, and other industries. Go to dataengineeringpodcast.com/coresignal and try Core Signal's self-service platform for free today. Your host is Tobias Macy, and today I'm interviewing Effie Baram about data engineering in the finance sector. So, Effie, can you start by introducing yourself?
[00:01:30] Effie Baram:
Yes. Thanks for having me. My name is Effie, and I've been leading foundational data engineering at Two Sigma for in this role for the last two years and in data engineering for the past four years.
[00:01:45] Tobias Macey:
And do you remember how you first got started working in data?
[00:01:48] Effie Baram:
Yes. That was about, ten years ago. I was actually overseeing reliability engineering at the time. And one of the roles that my team had was to procure and produce, research data from our trading systems. And it was a pretty large dataset at the time. The data ecosystem was a little bit different at the time. SLAs, especially for this dataset, were pretty tight. The quality requirements were very high, but they were no not defined. So you didn't really realize that you were producing garbage at the time. The datasets mostly fail when we ran them because the systems we use to produce them were cron like, so it was a hit or miss and, many times missed, many times for infra related reasons.
And the infrastructure logic together with the business logic were all intertwined. So you pretty much needed a PhD to operate and troubleshoot a relatively complex dataset at the time. So, this is when I was also first considering shifting to using, DAG orchestration systems. At the time, it was Airflow. And right there, that choice alone completely shifted how we were able to, manage these datasets. The separation between the business logic and the infrastructure and having the infrastructure be produced using a DAG orchestration system was the game changer for me. And that was really interesting. I didn't even consider that data was that complex and so delicate all at the same time.
[00:03:31] Tobias Macey:
Absolutely. And so bringing us now to where you are at Two Sigma, I'm wondering before we get too deep into what you're building there, if you can give a bit of an overview about some of the ways that data plays a role in the organization and some of the characteristics of the data that you need to work with
[00:03:52] Effie Baram:
there? Yeah. So Two Sigma, by and large, we we mostly focus on data. Data is at the core of, what we do. We either procure it from, vendors, you know, exchanges, wholesalers, think Reuters, Bloomberg. But we also produce a lot of data, and that's always been the case. We are mostly focused on research. So where you have a lot of businesses where the focus is on the actual, you know, production GA think of it, like, what's running in production. In our case, we spend most of our time understanding data and deriving meaningful, insights from it. And, specifically, in foundational data, think of us as wholesalers of those market foundational datas, which, you know, if you look at different industries, every every industry that has something to do with data would have that problem where you have a core dataset that you rely on. And all of your downstream consumers have certain expectations of that data. So for instance, in medical research, you'd probably have information about your patients, and it has to look a certain way. And you procure it from, you know, definitely not from vendors, but, from different, you know, sectors or sections of your hospital or research departments. So what my team does, we basically build and maintain the infrastructure to procure the datasets, and, we make sure that we deliver the datasets as quickly as needed for the various business needs. Not everything needs to be in the, microseconds.
Sometimes it's minutes. Sometimes it's day. And, again, depending on the, frequency, you know, whether, high frequency data, You might need specialized hardware to basically procure or, receive the data and transform it. Or if it's a higher frequency, you would have the opportunity to actually enhance the dataset and, make sure that it actually conforms to what your customers need.
[00:06:05] Tobias Macey:
Given the nature of the organization and the ways that the data is interacted with, as you mentioned, it's not necessarily what many listeners might be experienced with where the data that they are responsible for curating is going to be immediately used in some form of production context, whether that's business intelligence or analytics or user facing features, and instead it's more on the research role. I imagine that maybe the latency tolerance is a little bit higher, but the requirement around quality and accuracy is also going to be higher. And I'm wondering how you think about the areas of focus and the points of criticality in the work that you're doing given the context in which you're operating.
[00:06:51] Effie Baram:
Yeah. That's a really challenging problem. And the reason it is challenging is because the more accurate and rich your data is, the longer the journey is to make sure that the insights that one is expecting is. So, for example, if we consume certain data from a particular vendor and we have certain expectations for how it should look like, and we need a very deep rich history, say, going back thirty, forty years, that basically means that in order to deliver it to our customers, the journey to even begin that research is a long journey, longer than one would have the appetite for. So we had to really figure out a balance whereby we deliver, as much of a sample data. Think of it more like raw data with a looser schema, but the SLA is much lower so that your end user can start at least looking at the data and saying, is it the right shape? Does it have the right attributes that I might be looking for? Do I need thirty, forty years worth of history, and does it have to be fine grain in order for me to even begin, or can I start with maybe just a year sample? And behind the scenes, you have a lot of considerations that, again, in the past, I never even considered.
Legal considerations, cost, obviously, storage. But the final one, which, I think gets more complicated as time goes on, is really maintaining your schema as the richness of the data increases to make sure that your downstream customers are not impacted by it. And this is something where, you know, ten, fifteen years ago was much harder because chances are you were in a database with a very fixed, schema that everyone was expecting. And fast forward to today, you have data warehouses where the data could be a little bit looser, and you have multiple customers querying the data in different ways. So there's definitely more innovation, but you also have to get there. And that's where complexity is added.
[00:09:00] Tobias Macey:
And digging a bit more into that concept of foundational data engineering, Obviously, it brings along the connotation that the work that you're doing is required, and the level of reliability that you're responsible for is going to be quite high because everybody else is building on top of what you are creating. And I'm wondering how that shapes the ways that you think about the technology choices, the ways that you structure the work that you're doing, the pace of change that you're willing to accept because of the fact that everybody else is relying on you to be that point of stability.
[00:09:38] Effie Baram:
Yeah. Again, this is, this is proving to be a tremendous challenge and will probably remain that way because now we all have access to a lot more data over a much longer period of time at finer granularity and maybe higher or low well, lower latency. And so when you, when you add all that together, the ability to both deliver fast at that level of depth, really goes right up against the needs for much higher quality. So this is where what I've done was shifting what we're delivering to our customers, where think three and four years ago, we would spend significant amount of time upfront to make sure that data that we deliver meets all the production requirements for all our users. And so the journey to get there was simply slow, and the more data we added, the slower it got. And this, by the way, is also just, something that I've observed over the past ten years, wherein the past, datasets were a lot more naive, not as complex. Nowadays, the DAGs are extremely complicated. Lineage is extremely important, something that we never really considered before. And so the way we handled this, sort of conflicting pattern was to move the data that's needed for research into a looser, schema, more into the data warehouse where the quality and the history is not nearly as rich as what you would expect and simply, what you would expect in production. And what we would do is create certain milestones along the way. And what that does is, you know, it gives the researcher the opportunity to, one, augment their data. Sometimes research ideas end up dying on the vine, so not necessarily make a full commitment if you didn't really know that you're gonna go all in or even produce a leaner, more naive dataset upfront so you can get it into production faster and enrich it over time. You know, in in some ways, I think of it almost like data is code where your nucleus of your idea in the software, like, think of a proof of concept gets delivered first, and you build upon it over time. So everybody wins in this mode.
[00:12:07] Tobias Macey:
And with that platform mindset of the fact that you are building these systems for other people to be able to do their own work on top of it, and you're usually working with those end users to figure out what are their needs, what are their capabilities. Because of the fact that you're working with researchers, I imagine that they have at least some relatively high level of technical acumen to be able to bring in their own tools, to find their own workflows, which also can be a complicating factor as somebody responsible for a platform because they want ultimate flexibility, but you want to be able to enforce some level of controls and standards so that it doesn't turn into a mess for everyone else. And I'm curious how that has posed a challenge in terms of how you think about what are the interfaces, what are the capabilities that you want to empower them to have, and what are some of the ways that you want to either encourage some level of build their own platform addendums versus bringing those additional capabilities into the fold of your own control to make them generalized for everyone else?
[00:13:13] Effie Baram:
Yeah. That's, absolutely a significant challenge. The reality is that you have to support both. If you wanna go fast, you have to be able to operate in a more agile, looser, and likely a little bit away from your core platform offering. And if you want to go accurate, is when you go and bring your innovation back into the platform. And in some ways, I actually think it's a very reasonable model provided that you really box the number of experiments that you have, and you also give enough buffer to bring the experiment back into the platform. And this is easier said than done. We all know that. We tend to, immediately move on to the next experiment. That's definitely a pattern, that I that I've seen, and I understand. Obviously, the business pressures will always be greater than our ability to, to deliver. But the way that we are balancing the two is when we would create a tiger team behind a particular innovation that we want to foster. And the the thing that I would personally do from an engineering perspective when I'm working with the business is try to find, a technology vehicle or any any ideas that we have as engineers to run those through the business innovation to see if we can also bring those back into the platform. So I'll give you an example. We recently, in past couple of years, we moved to, relying on parquet files away from different file formats. And so to do that on a platform basically signs you up to a very lengthy migration process. And, usually, the business will have no appetite for it because to them, there is absolutely no value, for what we see, which is obviously performance standardization, extra tooling that basically is turnkey to your other platforms. So for us, it was a no brainer. And so what we did was we introduced the technology while working with the business on a new idea. And when we saw that we were actually able to get what we wanted, we used that as the pattern to map back into the platform using other projects. So these are some of the strategies that I employ so that the projects are not just all engineering driven because then they'll immediately be shut down for not having commercial value.
[00:15:39] Tobias Macey:
Another challenge that I run into periodically, particularly in the context of data work, is that somebody may build a system that works for their particular application. They have their own set of control flow for how to do the data processing, and they end up landing it in the context of an application database. And so then you see, okay. Well, this is a data engineering requirement. We can do this much more efficiently and more scalably and in a more generalized pattern that allows that data to be reused across more contexts. But then you have to justify the duplicative work of what they've already done to then allow for that data to be used in more use cases or to be able to standardize on different tool chains. And I'm wondering how you've generally approached the justification of that duplicative work where somebody has something that is functional, but you want to rebuild it in a different way and then figuring out what is that last mile of the the handoff to their operational context to say, okay. I've done all of the work that you were doing, and now here's what the actual interface looks like for you to access the same dataset without you having to completely reengineer your application or the data structures that it's reliant on.
[00:16:53] Effie Baram:
Yeah. It's, again, it's another very common challenge, and it's not unique just to data. It's unique in software engineering. I think the value of a line of code to one individual is obviously very, very high because it solves their problem a 100% of, the cases. Right? But when you really try to map it back to the platform, you now have to consider the in ways in which your particular, you know, feature is now written. And so, unfortunately, this this is both a common pattern. In some ways, it's also a good pattern because you might actually realize that this detour can be used to actually shift some of the patterns. But in order to do that, what I would recommend and what I've done is partnering early with the teams that are working on a particular feature and either through collaboration where we contribute some, they contribute some, we close the gap to make sure they don't they don't veer off, or we have contracts at the end of the project whereby we have some time basically allowed to make sure that we bring the feature back into the platform.
But, again, this is all under the umbrella of to deliver something fast for the business, it's very hard to do that while you have this really living, breathing, mature platform that needs to meet everyone else's requirements. Like, the two simply collide. And so being able to wield those experiments back is the single most important aspect. I think keeping the balance is is definitely needed, but the two will coexist.
[00:18:42] Tobias Macey:
So another challenge when you're working at that foundational layer is that you are going to largely be responsible for understanding and implementing any regulatory requirements or controls around the data that you're operating with. And given that you're in the financial sector, I imagine that there are a substantial number of them, and then ensuring that the people who are consuming the data understand the requirements and the reasoning for different security controls or access controls that are in place. And I'm wondering how you think about managing that tension of the regulatory and technical complexity that it brings along with the organizational communication and best practices around how to interact with that controlled dataset.
[00:19:30] Effie Baram:
Yeah. It's it's a great question. Though I would say, you know, every industry has its own version of constraints, whatever those might be. And in some ways, when you think about software development or problem solving as a whole, I find it operating in a constrained environment breeds more creativity. Because when you're very open ended, there is the opportunity to perhaps think a little bit more simplistically. But when you have guardrails and constraints, you actually have to consider so many additional use cases again, especially on a living, breathing system. So I personally see regulatory constraints almost as testability of your code. It puts the boundaries and the interface of what is expected of your data or the information that you're producing to contain and to have and have those receipts along the way. And so I I personally enjoy that because I find it more challenging and therefore more rewarding.
But, again, it's what is considered, regulatory in our industry, I would say, would have a different equivalent in, in other sectors, say, in in medicine, right, like HIPAA laws and and so on. So you have to consider those just as much as you have these.
[00:20:47] Tobias Macey:
And in terms of the technical considerations around building this data platform. Obviously, you want to make sure that the data is accessible, that you have some sort of controls, you have reliability. I'm wondering how you think about the selection of which tools to use off the shelf, the customizations that you build, and some of the specific in house technology that you've invested in to be able to facilitate this platform approach to empowering the organization to use data as its core resource.
[00:21:23] Effie Baram:
It's really interesting to be living, you know, in a time where there's a lot of AI capabilities on the right. You have a lot of turnkey solutions on the left. When you look back ten, fifteen years ago, if you needed to deliver a data platform or any platform for that matter that was more sophisticated than, you know, like, say, just a storage system, if you will. It's let's assume that it was doing some pretty things for for the business. When you think about that world, you need as a significant amount of software development investment. Whether you bought it off the shelf or or, effectively rooted your own, you needed to invest upfront significantly to build the platform before you even brought in the actual components, be it the data that's flowing through it or the the actual, business logic that you were writing. And so fast forward to today, a lot of those capabilities are available to you. But maybe not a 100%, but I would say 90% of, what we could possibly want to do in in in software engineering and certainly in data engineering is now available.
And so my personal philosophy is that investing and building nondifferentiated infrastructure is, something that you have to consider very carefully before you put forth the software development skills because that one takes away from solving the business problem. But the second part is that it requires a significant and continuous investment over time. You will never be in a position where you call a vendor or you simply upgrade your software, by getting a new download from your favorite vendor. Here, you actually have to debug the stack and make sure that, it really meets your continuous requirements.
So I personally am very much a buy versus build. That being said, it's not the solution for everything. I also have plenty of build solutions. For example, in the data quality space, back when we were looking at the vendor ecosystem and given our requirements, we really needed to solve it in a different way. And so we embarked on a journey to write our own, and, it really served us well for that particular purpose. In the storage and data warehousing, we used to have more proprietary systems. We're now, moving to using, you know, data warehouse solutions like, BigQuery, Query, Snowflake, you name it. Because those also come in wired with a range of other hooks that you don't have to worry about. So, again, you can put your parquet files in there, and you can hook it with dbt. And it comes with a a lot of, bells and whistles without having to, like, really teach your customers or, in my case, my researchers, or developers how to use that interface.
[00:24:29] Tobias Macey:
And then as far as the architectural patterns, you mentioned the kind of levels of completeness or levels of curation for the data. You mentioned that you're standardizing around more of these off the shelf warehouse components. And I'm wondering if you can just talk through some of the ways that you think about the architectural substrate and then the design patterns about how you manage the data through the various stages of its life cycle.
[00:24:58] Effie Baram:
Yes. So, again, moving to more common technologies, say, data warehouse, what that allows us to do today is to, one, standardize and normalize on all the data ingestion and bring in the data in its raw format. When you look back fifteen, twenty years, you had to have a certain shape to your data, as you brought it in. And when you had to make changes to the schemas, you had to basically do that very carefully, one, but two, chances are you didn't really have lineage in place to know, what has changed, when, by who. And so proceeding in that mode back then was much more complicated than it is today. So having one single data warehouse where all the data is ingested to has accelerated and normalized for us the ability to procure a lot of data from a much wider, you know, vendor sources without having to worry as much about the things that come much later in the, workflow of getting data ready. So that would be one. Then we move on to modeling the data and shaping it. And, again, this is something that in the past, we had to proceed very carefully because anything that you change might have adverse impact downstream to customers.
Here in in a platform like BigQuery, you can have multiple versions and views of the work that you're doing. You can checkpoint it. You can hook it to DBT and actually perform CI and CD. And to me, that's probably the most interesting shifts that I see in data and one of the most exciting one where in the past, it was pretty hard to consider your data as code. If you wrote SQL, good luck testing it. Fast forward to now, you have your pandas, You have dbt. You have capabilities to basically ensure that you model your data or make any transformations or changes to it, while having a record. And now we are able to actually treat the data as code. So we talked about ingestion into its ROS RAS format. We now have the capability to have multiple users look at the same data and basically derive the relevant meaning for them while we are focusing on modeling it. We can then take the data onto the next level and start preparing it for simulation.
That one, performance does matter, history matter, quality of the data matters. So we may not do it in our data warehouse because it it may not be meeting our requirements, but we have the ability to actually extract extract it. We create snapshots for, and, again, these are very, very much standardized so that all of our customers know what to expect, how to wire their experiments onto our datasets. And we basically provide them an environment that, looks and feels like what one would expect from a research environment. We get the feedback back from them, rehydrate the data in our data warehouse, enrich it, finalize the modeling. And once we have it ready to go, we then promote it into production. And, you know, the production system is probably not nearly as sophisticated, if you will, with all the research capabilities, but the reality is it doesn't need to be because we're not performing research in production.
[00:28:29] Tobias Macey:
And then since the last couple of years, the constant pressure is figuring out the role that AI plays, particularly as more of these agentic workflows become reasonable to implement and have better understanding around how they operate. And I'm wondering how you're thinking about the incorporation of AI utilities both in the creation and curation of your platform and your datasets as well as as an enablement to let your researchers apply AI tools to the data that you are responsible for.
[00:29:04] Effie Baram:
Yeah. I find that, we live in really in in some ways, I feel that anyone right now in the data space struck gold. These are really exciting times when the the ability to accelerate is like nothing I've seen in prior years. And, again, specifically, I'm speaking about data. I'm sure it's true elsewhere. But one of the things that really hindered our ability to move as fast as we wanted was because you had to really preserve and maintain how the data looked, for the rest of the ecosystem. Migrations were there. And I'm sure it's true also for infrastructure and what have you, but now I'm looking at the, agentic capabilities.
And in some ways, we have far more opportunities to make operational tasks and reproducible tasks a nonissue. And right there, that opens up an entire area where, a data engineer no longer has to worry about the mechanics of operating the plant. They truly can focus on extracting information from the data, which is very nuanced and hard to do, but this is where the time and value is. So the approach that we are taking on this journey is, you know, very measured. One, make sure that all the developers, all the users in this space have experienced, what it is to use, these technologies just in a very modest way, first for their own personal use. So developer productivity, understanding what the boundaries are, understanding the differences between one model or the other, understanding, you know, where it's applicable to solve meaningful problems and where you end up chasing rabbit holes. So the entire purpose is to really just get your toes wet, understand what the capabilities are. Step two for us is to identify areas that have proper guardrails so that we can really measure the, the use of, this technology in a way that we can feel comfortable trusting it. So examples, if we do very common transformations, you know, so some of our businesses are have very standard patterns to create their ETLs. What we do is we, we use, these technologies to basically accelerate that entire, journey. Data issues, finding gaps, communicating with vendors, or extracting information from PDF. So all these, you know, areas that traditionally were done by humans, leveraging this technology has proven to be not just very effective, but especially at scale. It's a huge time saver for us. And the goal that we have is by covering these two areas, you now have a more, sophisticated data engineer that understand what the tool sets and capabilities are. And you also have freed up enough space by taking on the toil, the real operational, aspects of what we do in order to now consider the next frontier in the areas that you want to solve for.
The opportunities are obviously there there's so many opportunities to even list here. But from our perspective, personally, from my perspective, the shift from how we operate as engineers is a significant one, and I wanna make sure that we do this carefully. Prompting is actually not easy, and you have to spend quite a bit of time making sure that you're giving the right context, making sure you're you're feeding your model the right data and making sure that the work is reproducible because the shift and evolution of this technology is so rapid that I could easily see this becoming a major toil production, if you will, in your environment. So we really have to change how we develop to assume that things will operate and change very quickly and fold them as we move along.
[00:33:28] Tobias Macey:
Another interesting aspect of working at that foundational layer of the organization in terms of the technical stack is that, as we've discussed, everybody is going to have their own ideas about what is the best approach for a particular thing, what is the area that the business should be investing in because, obviously, their idea is the most important or most impactful. And I'm wondering what you see as some of the aspects of the socio technical friction that you have found to be either most frequent or most challenging to address.
[00:34:03] Effie Baram:
I found that being able to meet my customer's requirements and pace, if I were to do it, if I had infinite resources at my disposal, I don't think that the end result will meet their expectations in the long run. And the reason I say that again is the notion of having a platform. There is something to be said if you had a platform that provides data with a very think of it as a data contract where it's very, you know, well defined what it is my customers are receiving, where I have guarantees on the quality of the data, and I give even capabilities for you to research much faster by providing, the data at a certain shape, along the way. It also reduces the time my, customer will need in order to hook their software into into the system that we just created had we gone down this path. And so one of the things that I, have done is to anchor on one or two things and do them really well. Think of it as ice cubes to basically counter the snowflake effect where everyone wants to have something very different and unique to them. I find that by and large, if you provide good ice cubes, good patterns, good APIs and contracts for your data, even if it does not meet the requirements of my customers at a 100%, but only at 90, they will opt to come back into the platform and use it today because it's available than it is by having forked. Because now you might have forked. You got off the ground very rapidly, but now you have to build the support function, the life cycle management.
You basically have to take an entire platform journey on this leaf node. And, obviously, oftentimes, it's not, you know, front and center when your customer is thinking about it. And so it's not something that I can negotiate upfront. Oftentimes, you really have very, demanding business needs that you you need to keep. But by having a platform that gives everyone what they need by and large, that usually covers the most ground. So this is where anchoring on technologies like BigQuery. The reason I picked it was because I knew that it would be able to meet most of my customers' need in a relative, rapid time. It also still solves my needs because I can stay within this platform. So right now, we're looking at the Google console offerings, including Gemini assist to see if maybe our data analysts themselves not having to leave the, BigQuery ecosystem and stay within this context. And, again, all this to basically accelerate. If I'm able to provide 80% of my my customers' need, I think I'm able to reduce some of that friction.
[00:37:09] Tobias Macey:
Another interesting challenge in terms of operating a platform is that the boundaries can become a little bit fuzzy because people want you to take on more responsibility or maybe you want to be able to exert more control over particular patterns versus the other trend, which is you want to seed control because you don't want to be responsible for as many pieces. And I'm wondering how you think about the definition and evolution of those boundary layers as you gain greater operational capacity or greater comfort with the different workflows and as more workflows become standardized.
[00:37:50] Effie Baram:
Yeah. This is definitely coming from reliability engineering, that's what we did always. You had customers on the right who would sell you their amazing solution that they're gonna pass on to the reliability team to safe keep, and then we had to basically walk the journey of bringing it up to the standards. And so in some ways, it's no different than software. I definitely find that you have customers that will basically create a proof of concept and expect the platform to just naturally absorb it and without any particular cost. I tend to pick teams that are willing to work with us to bridge some of that gap. Again, I think having, a team that basically worked off offline and now expects us to absorb their component into the system is not always a bad pattern. It's it's a pattern that could be used for, driving innovation.
And so when I make a decision to reabsorb and basically assume, other other team's innovation, I tend to do that because it also meets some of our engineering's requirement. There'll be times where it will basically stay off if it doesn't meet some of the basic requirements. So for instance, if, the data that you're producing doesn't meet the quality requirement, there will be cascading effects on the operations team. On other customers that haven't been considered upfront, we won't be able to take that. And so I basically use judgment when I make those decisions or trade offs.
I'm a big partner in that I really love to find opportunities with other teams because, usually, when you have two needing teams, I could use their technology, and they could use my services. You find that there is a much better outcome at the end of it versus really holding the line on, you know, this is a platform, and unless you meet the platform's requirement, you're out. I just find that that also risks teams forging, sideways and basically risking your platform to become null and void.
[00:40:09] Tobias Macey:
And as you have been responsible for the care and maintenance of such a core piece of the technology stack for the organization and worked with the various consumers of that platform to address their use cases. What are some of the most interesting or innovative or unexpected ways that you've either have seen your specific technology stack applied or, some interesting ideas that you've seen around the pattern of foundational data?
[00:40:39] Effie Baram:
I think that one of the areas that has started to emerge given the turnkey technologies available, cloud first technologies, and the agentic capabilities. The focus is really shifting towards data engineers having significant more domain context around the data. If ten, fifteen years ago, you you were expected as part of your software engineering role to build the infrastructure or at least integrate the infrastructure, in order to create the platform, that now is, for the most part, happening, in support of your, business needs. And so our data analyst and our engineers are expected to have real deeper, understanding of the intent of the data that they are, working with. And in some ways, we are elevating the skills of our engineers to work much closer to our researchers.
So even in research, not everything is just pure innovation. Sometimes you have to do a lot of, forecasting or, feature extraction. And so these are areas that we can more comfortably step into in augmenting the datasets with, enriching it with different datasets from other sectors or, you know, markets, if you will. And so these are areas that traditionally the researcher would do. They would basically own the entire research ecosystem. Now we can sit much closer alongside them. And this is definitely a shift that will make some software engineers thrive and excited, and others will find very challenging and not appealing depending on the the kind of software engineer you are. Are you a builder? Are you a problem solver in the domain? And if so, that will really make, your your career aspirations, like, really if you're in one one form or another. So that, I think, is one of the, biggest shifts that I see.
[00:42:52] Tobias Macey:
In your experience of building this foundational data system and working with the organization to take advantage of those capabilities and manage the team and the requirements around it, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:43:10] Effie Baram:
Probably the hardest surprising challenge that I found was looking at financial data in that in some ways, it has the, sort of it has a very well defined contract. Everyone in our industry consumes this data. You know, we all have our wholesalers that we, buy from. It follows a very particular shape. But, really, under the hood, once it lands with us, I discovered and realized that how we use the data, how different businesses use the data requires significant more domain expertise, and it's a lot more nuanced than I thought going in. One would think that, say, for instance, I buy historical data about IBM. It follows a very particular shape. It lends chances are everyone in my industry does the same. Well, maybe not. Maybe I augment the data with additional data, having to do with hardware, purchases or, you know, new CPU innovations in, in technology. So all of a sudden, the ability to really understand how to integrate different datasets into, what is really very course wholesome data, becomes a lot more nuanced and something that not all of the software engineers had an easy time effectively navigating.
[00:44:43] Tobias Macey:
And as you talk to people in other organizations, as you talk to people who are in peer relationships to you, what are the cases where you would say that the foundational data team approach is the wrong way to address the data needs of a given organization?
[00:45:02] Effie Baram:
What we do, unlike other data engineering organizations, is really spend a considerable amount of time in modeling the data. This is where you augment it with other data sources, where you understand both the intent of the data as the world views it, but also in how it will get integrated all the way through our systems, how it's ingested, how, say, the ops team clears it, how legal and compliance looks at it, how it's treated, how it's researched. So you have so many different customers that have to understand and extract the meaning from the data. And so in that in that way, I find that it's, it's a very specific, and rich type of space to be in because the context of what you're delivering matters a great deal. Where it does not fit the same pattern is when we have very specific ask from a researcher to a particular dataset.
So it's a one to one mapping. It's not a wholesale function. It does not necessarily have a very complex transformations, may not need significant history that's extremely rich and dense. And so in those cases, foundational data is not necessarily the right place to solve for these problems. I think of those type of requests data requests are more shallow and that there are many of them, but they're relatively simple. Simple transformation, simple downloads, and very few, if if at all, customers. It might be just one customer at the end. Foundational is very few Datasets, extremely sophisticated in in what they are actually modeling, and many customers along the, workflow, that I described earlier.
[00:46:55] Tobias Macey:
And as you continue to invest in and iterate on your data platform and stay abreast with the technological evolution of the ecosystem, What are some of the resources that you find particularly helpful as you plan for successive iterations of your technology stack, of your platform architecture, and some of the ways that you're thinking about the role of the foundational data layer as AI starts to subsume more of the technology ecosystem?
[00:47:30] Effie Baram:
I find it very challenging, actually, right now to keep up. And and, again, it's because the pace of innovation is unlike what I've seen in the past. It's very exciting. So I spend a significant time online reading up and also experimenting a lot myself to sample out this new model, this new LMM that just came out. What are the features? Does it actually, meet some of the needs that we have? Some technologies that, that we're sampling, we are trying to carve out as much time as we can in experimentation. But the goals that we're setting so that it's not completely open ended is that it has to, at least, aspirationally, it has to basically pay for itself at the very least, so that we're not completely, spending our time in r and d and are not actually producing.
So significant amount of time online, outside in ways that I haven't done as much in the past. Because I truly feel that if I were right now to not be looking at what's going on in the industry for the next six months, the world is gonna look quite different, six months from now. So that is one area. I spend a good amount of time with my colleagues. We brains with colleagues, former, current. We created a lot of working groups where we're sharing ideas, and we're effectively federating that research both inside and out again in ways that I haven't seen before. And I find it extremely helpful because there'll be others who are thinking about the same problems that we're having solving them in a more innovative way. Perhaps they already solved it. So for example, there is a surge right now in MCP service that we stood up, but rather than, like, sending a 100 of them out there through all of our working groups, we created an actual catalog and enumerated what those are, how to use them. Basically, we are, almost democratizing that work and helping each other out to basically get us ahead. And, again, it's it's something that I haven't seen as much, especially in a business context. You're usually sitting in front of a problem, and and you're trying to stay as focused on that. This one is a bit of a game changer in where we're all contributing and consuming at the same time, and it all helps us to actually accelerate innovation.
[00:50:01] Tobias Macey:
Are there any other aspects of the work that you're doing or the ways that you're thinking about the role and the applications of foundational data systems that we didn't discuss yet that you'd like to cover before we close out the show?
[00:50:13] Effie Baram:
I think that the evolution of data from what it was, say, twenty years ago, where it was more of a utility and the outcome of all the software development that, one would write. Fast forward to today, data is the product by and large. It's front and center. It basically has its own pillar in most engineering organizations. You see other areas in engineering starting to shift aside or taking just a different shape, whereas, data becoming really the core of what most businesses, rely on. And I think these are absolutely exciting time to be in data space. I see data always have, but now we also have the technologies to truly treat it as code. And with the proliferation of the agentic technologies, you definitely have the opportunity to spend a lot more time in deriving information, not just producing data. And that is something that gets me very excited because, again, ten, fifteen years ago, in order to play in the data space, you had to really carve out a significant amount of life cycle. That now is shrinking, giving, one the opportunity to truly treat data as a living, breathing, evolving, shifting, you know, entity that fuels a lot of ideas. And, I think the opportunities are limitless.
Very exciting.
[00:51:52] Tobias Macey:
Absolutely. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. The biggest gap that I struggle with still is good and pragmatically
[00:52:17] Effie Baram:
used lineage, solutions. And the reason I'm calling out, lineage is is a big gap is that with the evolution of, data and the the capabilities that it offers, you can no longer expect that the schema will be the same ten years from now, a year from now, a month from now. The transformations are becoming a lot more sophisticated. The producers, consumers, that sort of contribution pattern is increasing dramatically. And so to manage a complex meaningful dataset without fully having introspection to the various checkpoints along the way and that led to the production of that meaningful dataset is, becoming effectively like a no op. So good lineage systems.
The reason I find that still a gap is it's almost like back in the day in the operating system world, you had technologies like DTrace, where you needed to have really a PhD in order to fully understand why your server, behaved a certain way. So the premise was phenomenal. The implementation really required significant depth. I find that in some ways, it's not quite as complicated on the lineage side, but we need to be able to hook to both existing and, obviously, living, breathing already, built, datasets so that you are able to really, shape the the data into, future use cases that you can't consider today. You know?
[00:53:58] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Two Sigma and your overall approach to building that foundational data team and the platform approach to data systems. I appreciate the, time and effort that you're putting into that, and I hope you enjoy the rest of your day.
[00:54:16] Effie Baram:
Thank you so much, Tobias.
[00:54:25] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details.
Poor quality data keeps you from building best in class AI solutions. It costs you money and wastes precious engineering hours. There is a better way. Core signal's multi source enriched cleaned data will save you time and money. It covers millions of companies, employees, and job postings and can be accessed via API or as flat files. Over 700 companies work with Core Signal to develop AI solutions in investment, sales, recruitment, and other industries. Go to dataengineeringpodcast.com/coresignal and try Core Signal's self-service platform for free today. Your host is Tobias Macy, and today I'm interviewing Effie Baram about data engineering in the finance sector. So, Effie, can you start by introducing yourself?
[00:01:30] Effie Baram:
Yes. Thanks for having me. My name is Effie, and I've been leading foundational data engineering at Two Sigma for in this role for the last two years and in data engineering for the past four years.
[00:01:45] Tobias Macey:
And do you remember how you first got started working in data?
[00:01:48] Effie Baram:
Yes. That was about, ten years ago. I was actually overseeing reliability engineering at the time. And one of the roles that my team had was to procure and produce, research data from our trading systems. And it was a pretty large dataset at the time. The data ecosystem was a little bit different at the time. SLAs, especially for this dataset, were pretty tight. The quality requirements were very high, but they were no not defined. So you didn't really realize that you were producing garbage at the time. The datasets mostly fail when we ran them because the systems we use to produce them were cron like, so it was a hit or miss and, many times missed, many times for infra related reasons.
And the infrastructure logic together with the business logic were all intertwined. So you pretty much needed a PhD to operate and troubleshoot a relatively complex dataset at the time. So, this is when I was also first considering shifting to using, DAG orchestration systems. At the time, it was Airflow. And right there, that choice alone completely shifted how we were able to, manage these datasets. The separation between the business logic and the infrastructure and having the infrastructure be produced using a DAG orchestration system was the game changer for me. And that was really interesting. I didn't even consider that data was that complex and so delicate all at the same time.
[00:03:31] Tobias Macey:
Absolutely. And so bringing us now to where you are at Two Sigma, I'm wondering before we get too deep into what you're building there, if you can give a bit of an overview about some of the ways that data plays a role in the organization and some of the characteristics of the data that you need to work with
[00:03:52] Effie Baram:
there? Yeah. So Two Sigma, by and large, we we mostly focus on data. Data is at the core of, what we do. We either procure it from, vendors, you know, exchanges, wholesalers, think Reuters, Bloomberg. But we also produce a lot of data, and that's always been the case. We are mostly focused on research. So where you have a lot of businesses where the focus is on the actual, you know, production GA think of it, like, what's running in production. In our case, we spend most of our time understanding data and deriving meaningful, insights from it. And, specifically, in foundational data, think of us as wholesalers of those market foundational datas, which, you know, if you look at different industries, every every industry that has something to do with data would have that problem where you have a core dataset that you rely on. And all of your downstream consumers have certain expectations of that data. So for instance, in medical research, you'd probably have information about your patients, and it has to look a certain way. And you procure it from, you know, definitely not from vendors, but, from different, you know, sectors or sections of your hospital or research departments. So what my team does, we basically build and maintain the infrastructure to procure the datasets, and, we make sure that we deliver the datasets as quickly as needed for the various business needs. Not everything needs to be in the, microseconds.
Sometimes it's minutes. Sometimes it's day. And, again, depending on the, frequency, you know, whether, high frequency data, You might need specialized hardware to basically procure or, receive the data and transform it. Or if it's a higher frequency, you would have the opportunity to actually enhance the dataset and, make sure that it actually conforms to what your customers need.
[00:06:05] Tobias Macey:
Given the nature of the organization and the ways that the data is interacted with, as you mentioned, it's not necessarily what many listeners might be experienced with where the data that they are responsible for curating is going to be immediately used in some form of production context, whether that's business intelligence or analytics or user facing features, and instead it's more on the research role. I imagine that maybe the latency tolerance is a little bit higher, but the requirement around quality and accuracy is also going to be higher. And I'm wondering how you think about the areas of focus and the points of criticality in the work that you're doing given the context in which you're operating.
[00:06:51] Effie Baram:
Yeah. That's a really challenging problem. And the reason it is challenging is because the more accurate and rich your data is, the longer the journey is to make sure that the insights that one is expecting is. So, for example, if we consume certain data from a particular vendor and we have certain expectations for how it should look like, and we need a very deep rich history, say, going back thirty, forty years, that basically means that in order to deliver it to our customers, the journey to even begin that research is a long journey, longer than one would have the appetite for. So we had to really figure out a balance whereby we deliver, as much of a sample data. Think of it more like raw data with a looser schema, but the SLA is much lower so that your end user can start at least looking at the data and saying, is it the right shape? Does it have the right attributes that I might be looking for? Do I need thirty, forty years worth of history, and does it have to be fine grain in order for me to even begin, or can I start with maybe just a year sample? And behind the scenes, you have a lot of considerations that, again, in the past, I never even considered.
Legal considerations, cost, obviously, storage. But the final one, which, I think gets more complicated as time goes on, is really maintaining your schema as the richness of the data increases to make sure that your downstream customers are not impacted by it. And this is something where, you know, ten, fifteen years ago was much harder because chances are you were in a database with a very fixed, schema that everyone was expecting. And fast forward to today, you have data warehouses where the data could be a little bit looser, and you have multiple customers querying the data in different ways. So there's definitely more innovation, but you also have to get there. And that's where complexity is added.
[00:09:00] Tobias Macey:
And digging a bit more into that concept of foundational data engineering, Obviously, it brings along the connotation that the work that you're doing is required, and the level of reliability that you're responsible for is going to be quite high because everybody else is building on top of what you are creating. And I'm wondering how that shapes the ways that you think about the technology choices, the ways that you structure the work that you're doing, the pace of change that you're willing to accept because of the fact that everybody else is relying on you to be that point of stability.
[00:09:38] Effie Baram:
Yeah. Again, this is, this is proving to be a tremendous challenge and will probably remain that way because now we all have access to a lot more data over a much longer period of time at finer granularity and maybe higher or low well, lower latency. And so when you, when you add all that together, the ability to both deliver fast at that level of depth, really goes right up against the needs for much higher quality. So this is where what I've done was shifting what we're delivering to our customers, where think three and four years ago, we would spend significant amount of time upfront to make sure that data that we deliver meets all the production requirements for all our users. And so the journey to get there was simply slow, and the more data we added, the slower it got. And this, by the way, is also just, something that I've observed over the past ten years, wherein the past, datasets were a lot more naive, not as complex. Nowadays, the DAGs are extremely complicated. Lineage is extremely important, something that we never really considered before. And so the way we handled this, sort of conflicting pattern was to move the data that's needed for research into a looser, schema, more into the data warehouse where the quality and the history is not nearly as rich as what you would expect and simply, what you would expect in production. And what we would do is create certain milestones along the way. And what that does is, you know, it gives the researcher the opportunity to, one, augment their data. Sometimes research ideas end up dying on the vine, so not necessarily make a full commitment if you didn't really know that you're gonna go all in or even produce a leaner, more naive dataset upfront so you can get it into production faster and enrich it over time. You know, in in some ways, I think of it almost like data is code where your nucleus of your idea in the software, like, think of a proof of concept gets delivered first, and you build upon it over time. So everybody wins in this mode.
[00:12:07] Tobias Macey:
And with that platform mindset of the fact that you are building these systems for other people to be able to do their own work on top of it, and you're usually working with those end users to figure out what are their needs, what are their capabilities. Because of the fact that you're working with researchers, I imagine that they have at least some relatively high level of technical acumen to be able to bring in their own tools, to find their own workflows, which also can be a complicating factor as somebody responsible for a platform because they want ultimate flexibility, but you want to be able to enforce some level of controls and standards so that it doesn't turn into a mess for everyone else. And I'm curious how that has posed a challenge in terms of how you think about what are the interfaces, what are the capabilities that you want to empower them to have, and what are some of the ways that you want to either encourage some level of build their own platform addendums versus bringing those additional capabilities into the fold of your own control to make them generalized for everyone else?
[00:13:13] Effie Baram:
Yeah. That's, absolutely a significant challenge. The reality is that you have to support both. If you wanna go fast, you have to be able to operate in a more agile, looser, and likely a little bit away from your core platform offering. And if you want to go accurate, is when you go and bring your innovation back into the platform. And in some ways, I actually think it's a very reasonable model provided that you really box the number of experiments that you have, and you also give enough buffer to bring the experiment back into the platform. And this is easier said than done. We all know that. We tend to, immediately move on to the next experiment. That's definitely a pattern, that I that I've seen, and I understand. Obviously, the business pressures will always be greater than our ability to, to deliver. But the way that we are balancing the two is when we would create a tiger team behind a particular innovation that we want to foster. And the the thing that I would personally do from an engineering perspective when I'm working with the business is try to find, a technology vehicle or any any ideas that we have as engineers to run those through the business innovation to see if we can also bring those back into the platform. So I'll give you an example. We recently, in past couple of years, we moved to, relying on parquet files away from different file formats. And so to do that on a platform basically signs you up to a very lengthy migration process. And, usually, the business will have no appetite for it because to them, there is absolutely no value, for what we see, which is obviously performance standardization, extra tooling that basically is turnkey to your other platforms. So for us, it was a no brainer. And so what we did was we introduced the technology while working with the business on a new idea. And when we saw that we were actually able to get what we wanted, we used that as the pattern to map back into the platform using other projects. So these are some of the strategies that I employ so that the projects are not just all engineering driven because then they'll immediately be shut down for not having commercial value.
[00:15:39] Tobias Macey:
Another challenge that I run into periodically, particularly in the context of data work, is that somebody may build a system that works for their particular application. They have their own set of control flow for how to do the data processing, and they end up landing it in the context of an application database. And so then you see, okay. Well, this is a data engineering requirement. We can do this much more efficiently and more scalably and in a more generalized pattern that allows that data to be reused across more contexts. But then you have to justify the duplicative work of what they've already done to then allow for that data to be used in more use cases or to be able to standardize on different tool chains. And I'm wondering how you've generally approached the justification of that duplicative work where somebody has something that is functional, but you want to rebuild it in a different way and then figuring out what is that last mile of the the handoff to their operational context to say, okay. I've done all of the work that you were doing, and now here's what the actual interface looks like for you to access the same dataset without you having to completely reengineer your application or the data structures that it's reliant on.
[00:16:53] Effie Baram:
Yeah. It's, again, it's another very common challenge, and it's not unique just to data. It's unique in software engineering. I think the value of a line of code to one individual is obviously very, very high because it solves their problem a 100% of, the cases. Right? But when you really try to map it back to the platform, you now have to consider the in ways in which your particular, you know, feature is now written. And so, unfortunately, this this is both a common pattern. In some ways, it's also a good pattern because you might actually realize that this detour can be used to actually shift some of the patterns. But in order to do that, what I would recommend and what I've done is partnering early with the teams that are working on a particular feature and either through collaboration where we contribute some, they contribute some, we close the gap to make sure they don't they don't veer off, or we have contracts at the end of the project whereby we have some time basically allowed to make sure that we bring the feature back into the platform.
But, again, this is all under the umbrella of to deliver something fast for the business, it's very hard to do that while you have this really living, breathing, mature platform that needs to meet everyone else's requirements. Like, the two simply collide. And so being able to wield those experiments back is the single most important aspect. I think keeping the balance is is definitely needed, but the two will coexist.
[00:18:42] Tobias Macey:
So another challenge when you're working at that foundational layer is that you are going to largely be responsible for understanding and implementing any regulatory requirements or controls around the data that you're operating with. And given that you're in the financial sector, I imagine that there are a substantial number of them, and then ensuring that the people who are consuming the data understand the requirements and the reasoning for different security controls or access controls that are in place. And I'm wondering how you think about managing that tension of the regulatory and technical complexity that it brings along with the organizational communication and best practices around how to interact with that controlled dataset.
[00:19:30] Effie Baram:
Yeah. It's it's a great question. Though I would say, you know, every industry has its own version of constraints, whatever those might be. And in some ways, when you think about software development or problem solving as a whole, I find it operating in a constrained environment breeds more creativity. Because when you're very open ended, there is the opportunity to perhaps think a little bit more simplistically. But when you have guardrails and constraints, you actually have to consider so many additional use cases again, especially on a living, breathing system. So I personally see regulatory constraints almost as testability of your code. It puts the boundaries and the interface of what is expected of your data or the information that you're producing to contain and to have and have those receipts along the way. And so I I personally enjoy that because I find it more challenging and therefore more rewarding.
But, again, it's what is considered, regulatory in our industry, I would say, would have a different equivalent in, in other sectors, say, in in medicine, right, like HIPAA laws and and so on. So you have to consider those just as much as you have these.
[00:20:47] Tobias Macey:
And in terms of the technical considerations around building this data platform. Obviously, you want to make sure that the data is accessible, that you have some sort of controls, you have reliability. I'm wondering how you think about the selection of which tools to use off the shelf, the customizations that you build, and some of the specific in house technology that you've invested in to be able to facilitate this platform approach to empowering the organization to use data as its core resource.
[00:21:23] Effie Baram:
It's really interesting to be living, you know, in a time where there's a lot of AI capabilities on the right. You have a lot of turnkey solutions on the left. When you look back ten, fifteen years ago, if you needed to deliver a data platform or any platform for that matter that was more sophisticated than, you know, like, say, just a storage system, if you will. It's let's assume that it was doing some pretty things for for the business. When you think about that world, you need as a significant amount of software development investment. Whether you bought it off the shelf or or, effectively rooted your own, you needed to invest upfront significantly to build the platform before you even brought in the actual components, be it the data that's flowing through it or the the actual, business logic that you were writing. And so fast forward to today, a lot of those capabilities are available to you. But maybe not a 100%, but I would say 90% of, what we could possibly want to do in in in software engineering and certainly in data engineering is now available.
And so my personal philosophy is that investing and building nondifferentiated infrastructure is, something that you have to consider very carefully before you put forth the software development skills because that one takes away from solving the business problem. But the second part is that it requires a significant and continuous investment over time. You will never be in a position where you call a vendor or you simply upgrade your software, by getting a new download from your favorite vendor. Here, you actually have to debug the stack and make sure that, it really meets your continuous requirements.
So I personally am very much a buy versus build. That being said, it's not the solution for everything. I also have plenty of build solutions. For example, in the data quality space, back when we were looking at the vendor ecosystem and given our requirements, we really needed to solve it in a different way. And so we embarked on a journey to write our own, and, it really served us well for that particular purpose. In the storage and data warehousing, we used to have more proprietary systems. We're now, moving to using, you know, data warehouse solutions like, BigQuery, Query, Snowflake, you name it. Because those also come in wired with a range of other hooks that you don't have to worry about. So, again, you can put your parquet files in there, and you can hook it with dbt. And it comes with a a lot of, bells and whistles without having to, like, really teach your customers or, in my case, my researchers, or developers how to use that interface.
[00:24:29] Tobias Macey:
And then as far as the architectural patterns, you mentioned the kind of levels of completeness or levels of curation for the data. You mentioned that you're standardizing around more of these off the shelf warehouse components. And I'm wondering if you can just talk through some of the ways that you think about the architectural substrate and then the design patterns about how you manage the data through the various stages of its life cycle.
[00:24:58] Effie Baram:
Yes. So, again, moving to more common technologies, say, data warehouse, what that allows us to do today is to, one, standardize and normalize on all the data ingestion and bring in the data in its raw format. When you look back fifteen, twenty years, you had to have a certain shape to your data, as you brought it in. And when you had to make changes to the schemas, you had to basically do that very carefully, one, but two, chances are you didn't really have lineage in place to know, what has changed, when, by who. And so proceeding in that mode back then was much more complicated than it is today. So having one single data warehouse where all the data is ingested to has accelerated and normalized for us the ability to procure a lot of data from a much wider, you know, vendor sources without having to worry as much about the things that come much later in the, workflow of getting data ready. So that would be one. Then we move on to modeling the data and shaping it. And, again, this is something that in the past, we had to proceed very carefully because anything that you change might have adverse impact downstream to customers.
Here in in a platform like BigQuery, you can have multiple versions and views of the work that you're doing. You can checkpoint it. You can hook it to DBT and actually perform CI and CD. And to me, that's probably the most interesting shifts that I see in data and one of the most exciting one where in the past, it was pretty hard to consider your data as code. If you wrote SQL, good luck testing it. Fast forward to now, you have your pandas, You have dbt. You have capabilities to basically ensure that you model your data or make any transformations or changes to it, while having a record. And now we are able to actually treat the data as code. So we talked about ingestion into its ROS RAS format. We now have the capability to have multiple users look at the same data and basically derive the relevant meaning for them while we are focusing on modeling it. We can then take the data onto the next level and start preparing it for simulation.
That one, performance does matter, history matter, quality of the data matters. So we may not do it in our data warehouse because it it may not be meeting our requirements, but we have the ability to actually extract extract it. We create snapshots for, and, again, these are very, very much standardized so that all of our customers know what to expect, how to wire their experiments onto our datasets. And we basically provide them an environment that, looks and feels like what one would expect from a research environment. We get the feedback back from them, rehydrate the data in our data warehouse, enrich it, finalize the modeling. And once we have it ready to go, we then promote it into production. And, you know, the production system is probably not nearly as sophisticated, if you will, with all the research capabilities, but the reality is it doesn't need to be because we're not performing research in production.
[00:28:29] Tobias Macey:
And then since the last couple of years, the constant pressure is figuring out the role that AI plays, particularly as more of these agentic workflows become reasonable to implement and have better understanding around how they operate. And I'm wondering how you're thinking about the incorporation of AI utilities both in the creation and curation of your platform and your datasets as well as as an enablement to let your researchers apply AI tools to the data that you are responsible for.
[00:29:04] Effie Baram:
Yeah. I find that, we live in really in in some ways, I feel that anyone right now in the data space struck gold. These are really exciting times when the the ability to accelerate is like nothing I've seen in prior years. And, again, specifically, I'm speaking about data. I'm sure it's true elsewhere. But one of the things that really hindered our ability to move as fast as we wanted was because you had to really preserve and maintain how the data looked, for the rest of the ecosystem. Migrations were there. And I'm sure it's true also for infrastructure and what have you, but now I'm looking at the, agentic capabilities.
And in some ways, we have far more opportunities to make operational tasks and reproducible tasks a nonissue. And right there, that opens up an entire area where, a data engineer no longer has to worry about the mechanics of operating the plant. They truly can focus on extracting information from the data, which is very nuanced and hard to do, but this is where the time and value is. So the approach that we are taking on this journey is, you know, very measured. One, make sure that all the developers, all the users in this space have experienced, what it is to use, these technologies just in a very modest way, first for their own personal use. So developer productivity, understanding what the boundaries are, understanding the differences between one model or the other, understanding, you know, where it's applicable to solve meaningful problems and where you end up chasing rabbit holes. So the entire purpose is to really just get your toes wet, understand what the capabilities are. Step two for us is to identify areas that have proper guardrails so that we can really measure the, the use of, this technology in a way that we can feel comfortable trusting it. So examples, if we do very common transformations, you know, so some of our businesses are have very standard patterns to create their ETLs. What we do is we, we use, these technologies to basically accelerate that entire, journey. Data issues, finding gaps, communicating with vendors, or extracting information from PDF. So all these, you know, areas that traditionally were done by humans, leveraging this technology has proven to be not just very effective, but especially at scale. It's a huge time saver for us. And the goal that we have is by covering these two areas, you now have a more, sophisticated data engineer that understand what the tool sets and capabilities are. And you also have freed up enough space by taking on the toil, the real operational, aspects of what we do in order to now consider the next frontier in the areas that you want to solve for.
The opportunities are obviously there there's so many opportunities to even list here. But from our perspective, personally, from my perspective, the shift from how we operate as engineers is a significant one, and I wanna make sure that we do this carefully. Prompting is actually not easy, and you have to spend quite a bit of time making sure that you're giving the right context, making sure you're you're feeding your model the right data and making sure that the work is reproducible because the shift and evolution of this technology is so rapid that I could easily see this becoming a major toil production, if you will, in your environment. So we really have to change how we develop to assume that things will operate and change very quickly and fold them as we move along.
[00:33:28] Tobias Macey:
Another interesting aspect of working at that foundational layer of the organization in terms of the technical stack is that, as we've discussed, everybody is going to have their own ideas about what is the best approach for a particular thing, what is the area that the business should be investing in because, obviously, their idea is the most important or most impactful. And I'm wondering what you see as some of the aspects of the socio technical friction that you have found to be either most frequent or most challenging to address.
[00:34:03] Effie Baram:
I found that being able to meet my customer's requirements and pace, if I were to do it, if I had infinite resources at my disposal, I don't think that the end result will meet their expectations in the long run. And the reason I say that again is the notion of having a platform. There is something to be said if you had a platform that provides data with a very think of it as a data contract where it's very, you know, well defined what it is my customers are receiving, where I have guarantees on the quality of the data, and I give even capabilities for you to research much faster by providing, the data at a certain shape, along the way. It also reduces the time my, customer will need in order to hook their software into into the system that we just created had we gone down this path. And so one of the things that I, have done is to anchor on one or two things and do them really well. Think of it as ice cubes to basically counter the snowflake effect where everyone wants to have something very different and unique to them. I find that by and large, if you provide good ice cubes, good patterns, good APIs and contracts for your data, even if it does not meet the requirements of my customers at a 100%, but only at 90, they will opt to come back into the platform and use it today because it's available than it is by having forked. Because now you might have forked. You got off the ground very rapidly, but now you have to build the support function, the life cycle management.
You basically have to take an entire platform journey on this leaf node. And, obviously, oftentimes, it's not, you know, front and center when your customer is thinking about it. And so it's not something that I can negotiate upfront. Oftentimes, you really have very, demanding business needs that you you need to keep. But by having a platform that gives everyone what they need by and large, that usually covers the most ground. So this is where anchoring on technologies like BigQuery. The reason I picked it was because I knew that it would be able to meet most of my customers' need in a relative, rapid time. It also still solves my needs because I can stay within this platform. So right now, we're looking at the Google console offerings, including Gemini assist to see if maybe our data analysts themselves not having to leave the, BigQuery ecosystem and stay within this context. And, again, all this to basically accelerate. If I'm able to provide 80% of my my customers' need, I think I'm able to reduce some of that friction.
[00:37:09] Tobias Macey:
Another interesting challenge in terms of operating a platform is that the boundaries can become a little bit fuzzy because people want you to take on more responsibility or maybe you want to be able to exert more control over particular patterns versus the other trend, which is you want to seed control because you don't want to be responsible for as many pieces. And I'm wondering how you think about the definition and evolution of those boundary layers as you gain greater operational capacity or greater comfort with the different workflows and as more workflows become standardized.
[00:37:50] Effie Baram:
Yeah. This is definitely coming from reliability engineering, that's what we did always. You had customers on the right who would sell you their amazing solution that they're gonna pass on to the reliability team to safe keep, and then we had to basically walk the journey of bringing it up to the standards. And so in some ways, it's no different than software. I definitely find that you have customers that will basically create a proof of concept and expect the platform to just naturally absorb it and without any particular cost. I tend to pick teams that are willing to work with us to bridge some of that gap. Again, I think having, a team that basically worked off offline and now expects us to absorb their component into the system is not always a bad pattern. It's it's a pattern that could be used for, driving innovation.
And so when I make a decision to reabsorb and basically assume, other other team's innovation, I tend to do that because it also meets some of our engineering's requirement. There'll be times where it will basically stay off if it doesn't meet some of the basic requirements. So for instance, if, the data that you're producing doesn't meet the quality requirement, there will be cascading effects on the operations team. On other customers that haven't been considered upfront, we won't be able to take that. And so I basically use judgment when I make those decisions or trade offs.
I'm a big partner in that I really love to find opportunities with other teams because, usually, when you have two needing teams, I could use their technology, and they could use my services. You find that there is a much better outcome at the end of it versus really holding the line on, you know, this is a platform, and unless you meet the platform's requirement, you're out. I just find that that also risks teams forging, sideways and basically risking your platform to become null and void.
[00:40:09] Tobias Macey:
And as you have been responsible for the care and maintenance of such a core piece of the technology stack for the organization and worked with the various consumers of that platform to address their use cases. What are some of the most interesting or innovative or unexpected ways that you've either have seen your specific technology stack applied or, some interesting ideas that you've seen around the pattern of foundational data?
[00:40:39] Effie Baram:
I think that one of the areas that has started to emerge given the turnkey technologies available, cloud first technologies, and the agentic capabilities. The focus is really shifting towards data engineers having significant more domain context around the data. If ten, fifteen years ago, you you were expected as part of your software engineering role to build the infrastructure or at least integrate the infrastructure, in order to create the platform, that now is, for the most part, happening, in support of your, business needs. And so our data analyst and our engineers are expected to have real deeper, understanding of the intent of the data that they are, working with. And in some ways, we are elevating the skills of our engineers to work much closer to our researchers.
So even in research, not everything is just pure innovation. Sometimes you have to do a lot of, forecasting or, feature extraction. And so these are areas that we can more comfortably step into in augmenting the datasets with, enriching it with different datasets from other sectors or, you know, markets, if you will. And so these are areas that traditionally the researcher would do. They would basically own the entire research ecosystem. Now we can sit much closer alongside them. And this is definitely a shift that will make some software engineers thrive and excited, and others will find very challenging and not appealing depending on the the kind of software engineer you are. Are you a builder? Are you a problem solver in the domain? And if so, that will really make, your your career aspirations, like, really if you're in one one form or another. So that, I think, is one of the, biggest shifts that I see.
[00:42:52] Tobias Macey:
In your experience of building this foundational data system and working with the organization to take advantage of those capabilities and manage the team and the requirements around it, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:43:10] Effie Baram:
Probably the hardest surprising challenge that I found was looking at financial data in that in some ways, it has the, sort of it has a very well defined contract. Everyone in our industry consumes this data. You know, we all have our wholesalers that we, buy from. It follows a very particular shape. But, really, under the hood, once it lands with us, I discovered and realized that how we use the data, how different businesses use the data requires significant more domain expertise, and it's a lot more nuanced than I thought going in. One would think that, say, for instance, I buy historical data about IBM. It follows a very particular shape. It lends chances are everyone in my industry does the same. Well, maybe not. Maybe I augment the data with additional data, having to do with hardware, purchases or, you know, new CPU innovations in, in technology. So all of a sudden, the ability to really understand how to integrate different datasets into, what is really very course wholesome data, becomes a lot more nuanced and something that not all of the software engineers had an easy time effectively navigating.
[00:44:43] Tobias Macey:
And as you talk to people in other organizations, as you talk to people who are in peer relationships to you, what are the cases where you would say that the foundational data team approach is the wrong way to address the data needs of a given organization?
[00:45:02] Effie Baram:
What we do, unlike other data engineering organizations, is really spend a considerable amount of time in modeling the data. This is where you augment it with other data sources, where you understand both the intent of the data as the world views it, but also in how it will get integrated all the way through our systems, how it's ingested, how, say, the ops team clears it, how legal and compliance looks at it, how it's treated, how it's researched. So you have so many different customers that have to understand and extract the meaning from the data. And so in that in that way, I find that it's, it's a very specific, and rich type of space to be in because the context of what you're delivering matters a great deal. Where it does not fit the same pattern is when we have very specific ask from a researcher to a particular dataset.
So it's a one to one mapping. It's not a wholesale function. It does not necessarily have a very complex transformations, may not need significant history that's extremely rich and dense. And so in those cases, foundational data is not necessarily the right place to solve for these problems. I think of those type of requests data requests are more shallow and that there are many of them, but they're relatively simple. Simple transformation, simple downloads, and very few, if if at all, customers. It might be just one customer at the end. Foundational is very few Datasets, extremely sophisticated in in what they are actually modeling, and many customers along the, workflow, that I described earlier.
[00:46:55] Tobias Macey:
And as you continue to invest in and iterate on your data platform and stay abreast with the technological evolution of the ecosystem, What are some of the resources that you find particularly helpful as you plan for successive iterations of your technology stack, of your platform architecture, and some of the ways that you're thinking about the role of the foundational data layer as AI starts to subsume more of the technology ecosystem?
[00:47:30] Effie Baram:
I find it very challenging, actually, right now to keep up. And and, again, it's because the pace of innovation is unlike what I've seen in the past. It's very exciting. So I spend a significant time online reading up and also experimenting a lot myself to sample out this new model, this new LMM that just came out. What are the features? Does it actually, meet some of the needs that we have? Some technologies that, that we're sampling, we are trying to carve out as much time as we can in experimentation. But the goals that we're setting so that it's not completely open ended is that it has to, at least, aspirationally, it has to basically pay for itself at the very least, so that we're not completely, spending our time in r and d and are not actually producing.
So significant amount of time online, outside in ways that I haven't done as much in the past. Because I truly feel that if I were right now to not be looking at what's going on in the industry for the next six months, the world is gonna look quite different, six months from now. So that is one area. I spend a good amount of time with my colleagues. We brains with colleagues, former, current. We created a lot of working groups where we're sharing ideas, and we're effectively federating that research both inside and out again in ways that I haven't seen before. And I find it extremely helpful because there'll be others who are thinking about the same problems that we're having solving them in a more innovative way. Perhaps they already solved it. So for example, there is a surge right now in MCP service that we stood up, but rather than, like, sending a 100 of them out there through all of our working groups, we created an actual catalog and enumerated what those are, how to use them. Basically, we are, almost democratizing that work and helping each other out to basically get us ahead. And, again, it's it's something that I haven't seen as much, especially in a business context. You're usually sitting in front of a problem, and and you're trying to stay as focused on that. This one is a bit of a game changer in where we're all contributing and consuming at the same time, and it all helps us to actually accelerate innovation.
[00:50:01] Tobias Macey:
Are there any other aspects of the work that you're doing or the ways that you're thinking about the role and the applications of foundational data systems that we didn't discuss yet that you'd like to cover before we close out the show?
[00:50:13] Effie Baram:
I think that the evolution of data from what it was, say, twenty years ago, where it was more of a utility and the outcome of all the software development that, one would write. Fast forward to today, data is the product by and large. It's front and center. It basically has its own pillar in most engineering organizations. You see other areas in engineering starting to shift aside or taking just a different shape, whereas, data becoming really the core of what most businesses, rely on. And I think these are absolutely exciting time to be in data space. I see data always have, but now we also have the technologies to truly treat it as code. And with the proliferation of the agentic technologies, you definitely have the opportunity to spend a lot more time in deriving information, not just producing data. And that is something that gets me very excited because, again, ten, fifteen years ago, in order to play in the data space, you had to really carve out a significant amount of life cycle. That now is shrinking, giving, one the opportunity to truly treat data as a living, breathing, evolving, shifting, you know, entity that fuels a lot of ideas. And, I think the opportunities are limitless.
Very exciting.
[00:51:52] Tobias Macey:
Absolutely. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. The biggest gap that I struggle with still is good and pragmatically
[00:52:17] Effie Baram:
used lineage, solutions. And the reason I'm calling out, lineage is is a big gap is that with the evolution of, data and the the capabilities that it offers, you can no longer expect that the schema will be the same ten years from now, a year from now, a month from now. The transformations are becoming a lot more sophisticated. The producers, consumers, that sort of contribution pattern is increasing dramatically. And so to manage a complex meaningful dataset without fully having introspection to the various checkpoints along the way and that led to the production of that meaningful dataset is, becoming effectively like a no op. So good lineage systems.
The reason I find that still a gap is it's almost like back in the day in the operating system world, you had technologies like DTrace, where you needed to have really a PhD in order to fully understand why your server, behaved a certain way. So the premise was phenomenal. The implementation really required significant depth. I find that in some ways, it's not quite as complicated on the lineage side, but we need to be able to hook to both existing and, obviously, living, breathing already, built, datasets so that you are able to really, shape the the data into, future use cases that you can't consider today. You know?
[00:53:58] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Two Sigma and your overall approach to building that foundational data team and the platform approach to data systems. I appreciate the, time and effort that you're putting into that, and I hope you enjoy the rest of your day.
[00:54:16] Effie Baram:
Thank you so much, Tobias.
[00:54:25] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to Effie Barham and Data Engineering
Data's Role at Two Sigma
Balancing Data Quality and Latency
Foundational Data Engineering Challenges
Platform Mindset and User Empowerment
Regulatory Constraints and Data Management
Architectural Patterns and Data Lifecycle
AI's Role in Data Engineering
Socio-Technical Friction in Data Platforms
Innovative Uses of Foundational Data
When Foundational Data Teams Aren't the Answer
Planning for Future Technological Iterations