Foundational Data Engineering At 2Sigma

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale.

DataFold's AI powered migration agent changes all that.

Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches.

And they're so confident in their solution, they'll actually guarantee your timeline in writing.

Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds

today for the details.

Poor quality data keeps you from building best in class AI solutions. It costs you money and wastes precious engineering hours. There is a better way. Core signal's multi source enriched cleaned data will save you time and money. It covers millions of companies, employees, and job postings and can be accessed via API or as flat files.

Over 700 companies work with Core Signal to develop AI solutions in investment, sales, recruitment, and other industries.

Go to dataengineeringpodcast.com/coresignal

and try Core Signal's self-service platform for free today. Your host is Tobias Macy, and today I'm interviewing Effie Baram about data engineering in the finance sector. So, Effie, can you start by introducing yourself?

Yes. Thanks for having me. My name is Effie, and I've been leading foundational data engineering

at Two Sigma for

in this role for the last two years and in data engineering

for the past four years.

And do you remember how you first got started working in data?

Yes. That was about,

ten years ago. I was actually

overseeing

reliability engineering at the time. And one of the roles that my team had was to procure and produce,

research data from our trading systems.

And it was a pretty large dataset at the time. The

data ecosystem was a little bit different at the time. SLAs,

especially for this dataset,

were pretty tight. The quality requirements were very high, but they were no not defined. So you didn't really realize that you were producing garbage at the time.

The datasets mostly fail when we ran them because the systems we use to produce them were cron like, so it was

a hit or miss and, many times missed, many times for infra related reasons.

And the infrastructure

logic

together with the business logic were all intertwined. So you pretty much needed a PhD

to operate and troubleshoot

a relatively

complex dataset at the time.

So,

this is when I was also first considering

shifting to using,

DAG orchestration systems. At the time, it was Airflow.

And

right there, that choice alone

completely shifted how we were able to,

manage

these datasets. The separation between the business logic and the infrastructure and having the infrastructure be produced using a DAG orchestration system was the game changer for me. And that was really interesting. I didn't even consider

that data was that complex

and so delicate

all at the same time.

Absolutely. And so bringing us now to where you are at Two Sigma, I'm wondering before we get too deep into

what you're building there, if you can give a bit of an overview about some of the

ways that data plays a role in the organization

and some of the characteristics of the data that you need to work with

there? Yeah. So

Two Sigma,

by and large, we we mostly focus

on data. Data is at the core of,

what we do.

We either

procure it from, vendors, you know, exchanges,

wholesalers, think Reuters, Bloomberg.

But we also produce a lot of data, and that's always been the case. We

are mostly focused on research. So where you have a lot of businesses where

the focus is on

the actual,

you know, production GA think of it, like, what's running in production. In our case, we spend most of our time understanding

data and deriving meaningful,

insights from it. And, specifically,

in foundational data, think of us as wholesalers

of those market foundational datas, which, you know, if you look at different industries, every

every industry that has something to do with data would have that problem where you have a core dataset that you rely on. And all of your downstream consumers

have certain expectations

of that data. So for instance, in medical research, you'd probably have information about your patients, and it has to look a certain way. And you procure it from, you know, definitely not from vendors, but, from different,

you know, sectors or sections of your hospital or research departments. So what my team

does, we basically build and maintain the infrastructure

to

procure the datasets,

and,

we make sure that we deliver the datasets

as quickly as needed for the various business needs. Not everything needs to be in the, microseconds.

Sometimes it's minutes. Sometimes it's day. And, again, depending on the, frequency,

you know, whether, high frequency data, You might need specialized hardware to basically

procure or, receive the data and transform it. Or if it's

a higher frequency,

you would have the opportunity

to actually enhance the dataset and,

make sure that it actually conforms to what your customers need.

Given the nature of the organization and the ways that the data is interacted with, as you mentioned, it's not necessarily

what many listeners might be experienced with where the data that they are responsible for curating

is going to be immediately used in some form of production context, whether that's business intelligence or analytics or user facing features,

and instead it's more on the research role.

I imagine that maybe the latency

tolerance is a little bit higher, but the requirement around quality and accuracy is also going to be higher. And I'm wondering how you think about the areas of focus and the points of criticality in the work that you're doing given the context in which you're operating.

Yeah. That's a really challenging

problem.

And the reason it is challenging is because the more accurate and rich your data is, the longer the journey is to make sure that the insights that one is expecting

is.

So, for example,

if we consume certain data from a particular vendor and we have certain expectations

for how it should look like, and we need a very deep rich history, say, going back

thirty, forty years, that basically means that in order to deliver it to our customers, the journey to even begin that research

is a long journey, longer than one would have the appetite for. So we had to

really figure out a balance whereby we deliver,

as much of a sample data. Think of it more like raw data with a looser schema, but the SLA is much lower

so that your end user can start at least

looking at the data and saying, is it the right shape? Does it have the right attributes that I might be looking for? Do I need thirty, forty years worth of history, and does it have to be fine grain in order for me to even begin, or can I start with maybe just a year sample? And behind the scenes, you have a lot of considerations

that, again, in the past, I never even considered.

Legal

considerations,

cost,

obviously,

storage.

But the final one, which,

I think gets more complicated as time goes on, is really maintaining

your schema as

the richness of the data

increases

to make sure that your downstream customers are not impacted by it. And this is something where, you know, ten, fifteen years ago was much harder because chances are you were in a database with a very fixed,

schema that everyone was expecting.

And fast forward to today, you have data warehouses where the data could be a little bit looser, and you have multiple customers querying the data in different ways. So there's definitely more innovation, but you also have to get there. And that's where complexity is added.

And digging a bit more into that concept of foundational data engineering,

Obviously, it brings along the connotation that the work that you're doing is

required,

and the level of reliability that you're responsible for is going to be quite high because everybody else is building on top of what you are creating.

And I'm wondering how that shapes the ways that you think about the technology choices,

the ways that you structure the work that you're doing, the pace of change that you're willing to accept because of the fact that everybody else is relying on you to be that point of stability.

Yeah. Again, this is, this is proving to be a tremendous challenge

and will probably remain that way because now

we all have access

to a lot more data over a much longer period of time at finer granularity and maybe

higher

or low well, lower latency. And so when you,

when you add all that together, the ability to both deliver fast

at that level of depth,

really goes right up against

the needs for much higher quality. So

this is

where what I've done was shifting

what we're delivering to our customers,

where

think three and four years ago, we would spend significant amount of time upfront to make sure that data that we deliver

meets all the production requirements for all our users. And so the journey to get there was simply slow, and the more data we added, the slower it got. And this, by the way, is also just,

something that I've observed over the past ten years, wherein the past, datasets were a lot more naive, not as complex. Nowadays, the DAGs are extremely

complicated. Lineage is extremely important, something that we never really considered before. And so the way we handled

this,

sort of conflicting pattern was to move the data that's needed for research into a looser,

schema, more into the data warehouse where the quality and the history is not nearly as rich as what you would expect and simply,

what you would expect in production. And what we would do is create certain milestones

along the way. And what that does is, you know, it gives the researcher the opportunity to, one, augment their data. Sometimes research ideas end up dying on the vine, so not necessarily make a full commitment if you didn't really know that you're gonna go all in or even produce

a leaner, more

naive dataset upfront so you can get it into production faster and enrich it over time. You know, in in some ways, I think of it almost like data is code where

your nucleus of your idea in the software, like, think of a proof of concept gets delivered first, and you build upon it over time. So everybody wins in this mode.

And with

that platform mindset of the fact that you are building these systems for other people to be able to do their own work on top of it, and you're usually working with those end users to figure out what are their needs, what are their capabilities.

Because of the fact that you're working with researchers, I imagine that they have at least some relatively high level of technical acumen to be able

to bring in their own tools, to find their own workflows, which also

can be a complicating factor as somebody responsible for a platform because they want ultimate flexibility, but you want to be able to enforce some level of controls and standards so that it doesn't turn into a mess for everyone else. And I'm curious how that has

posed a challenge in terms of how you think about what are the interfaces, what are the capabilities that you want to

empower them to have, and what are some of the ways that you want to either

encourage some level of

build their own platform addendums versus

bringing those additional capabilities into the fold of your own control to make them generalized for everyone else?

Yeah. That's,

absolutely

a significant

challenge.

The reality is that you have to support both. If you wanna go fast,

you have to be able to operate in a more agile, looser,

and likely a little bit away from your core platform offering. And if you want to go accurate,

is when you go and bring your innovation back into the platform. And in some ways, I actually think it's a very reasonable

model provided that you really box the number of experiments that you have, and you also give enough buffer to bring the experiment back into the platform.

And this is easier said than done. We all know that. We tend to, immediately move on to the next experiment. That's definitely a pattern,

that I that I've seen, and I understand. Obviously, the business pressures will always be greater than our ability

to, to deliver. But the way that we are balancing the two is when we would create a tiger team behind a particular

innovation that we want to foster. And the the thing that I would personally do from an engineering perspective when I'm working with the business

is try to find,

a technology

vehicle or any any ideas that we have as engineers

to run those through the business innovation to see if we can also bring those back into the platform. So I'll give you an example. We recently,

in past couple of years, we moved to,

relying on parquet files away from different file formats. And so to do that on a platform basically

signs you up to a very lengthy migration process. And, usually, the business will have no appetite for it because to them, there is absolutely no value,

for what we see, which is obviously performance standardization,

extra tooling that basically is turnkey to your other platforms. So for us, it was a no brainer. And so what we did was we introduced the technology

while working with the business on a new idea. And when we saw that we were actually able to get what we wanted,

we used that as the pattern to map back into the platform using other projects. So these are some of the strategies

that I employ so that the projects are not just all engineering driven because then they'll immediately be shut down for not having commercial value.

Another challenge that I run into periodically,

particularly in the context of data work, is that

somebody may build a system that works for their particular application.

They have their own set of control flow for how to do the data processing,

and they end up landing it in the context of an application database. And so then you see, okay. Well, this is a data engineering requirement.

We can do this much more efficiently and more scalably and in a more generalized pattern that allows that data to be reused across more contexts.

But then you have

to justify

the duplicative work of what they've already done to then allow for that data to be used in more use cases or to be able to standardize on different tool chains. And I'm wondering how you've generally approached the justification of that duplicative work where somebody has something that is functional, but you want to rebuild it in a different way and then figuring out what is that last mile of the the handoff to their operational context to say, okay. I've done all of the work that you were doing, and now here's what the actual interface looks like for you to access the same dataset without you having to completely reengineer your application or the data structures that it's reliant on.

Yeah. It's,

again, it's another

very common

challenge, and it's not unique just to data. It's unique in software engineering. I think the

value of a line of code

to one individual

is obviously very, very high because it solves their problem

a 100%

of,

the cases. Right?

But when you really try to map it back to the platform,

you now have to consider

the

in ways in which your particular,

you know, feature is now written.

And so,

unfortunately, this this is both a common pattern. In some ways, it's also a good pattern because you might actually realize that this detour can be

used to actually shift some of the patterns. But in order to do that, what I would recommend

and what I've done is partnering early

with the teams that are working on a particular feature and

either through collaboration

where

we contribute some, they contribute some, we close the gap to make sure they don't they don't veer off, or we have contracts

at the end of the project

whereby we have some time

basically allowed

to make sure that we bring the feature back into the platform.

But, again, this is all under the umbrella of to deliver something fast for the business, it's very hard to do that

while you have this really living, breathing, mature

platform that needs to meet everyone else's requirements. Like, the two simply collide. And so being able to wield those experiments back is the single most important

aspect. I think keeping the balance is is definitely needed, but the two will coexist.

So another challenge

when you're working at that foundational layer is that you are going to largely be responsible for understanding and implementing

any regulatory

requirements or controls

around the data that you're operating with. And given that you're in the financial sector, I imagine that there are a substantial number of them, and then

ensuring that the people who are consuming the data understand

the requirements and the reasoning for different security controls or

access controls that are in place. And I'm wondering how you think about

managing

that tension of the regulatory

and technical complexity that it brings along with the organizational communication

and best practices around how to interact with that controlled dataset.

Yeah. It's it's a great question. Though I would say,

you know, every industry

has its own version of

constraints,

whatever those might be. And

in some ways, when you think about software development or problem solving as a whole, I find it operating in a constrained environment breeds more creativity. Because when you're very open ended, there is the opportunity

to perhaps think a little bit more simplistically.

But when you have guardrails and constraints, you actually have to consider so many additional

use cases again, especially on a living, breathing

system. So

I personally see regulatory

constraints almost as

testability

of your code. It puts the boundaries and the interface of what is expected of your data or the information that you're producing

to contain and to have and have those receipts along the way. And so I I personally enjoy that because I find it more challenging and therefore more rewarding.

But, again, it's what is considered,

regulatory

in our industry, I would say, would have a different

equivalent

in, in other sectors, say, in in medicine, right, like HIPAA laws and and so on. So you have to consider those just as much as you have these.

And in terms of the technical

considerations

around building this data platform.

Obviously, you want to make sure that the data is accessible, that you have some sort of controls, you have reliability.

I'm wondering how you think about the

selection of which tools

to use off the shelf, the

customizations

that you build, and some of the specific

in house technology that you've invested in to be able to facilitate this platform approach to empowering the organization to use data as its core resource.

It's really interesting

to be living, you know, in a time where there's a lot of AI capabilities on the right. You have a lot of turnkey solutions on the left. When you look back ten, fifteen years ago, if you needed

to deliver

a data platform or any platform for that matter

that was

more sophisticated

than,

you know, like, say, just a storage system, if you will. It's let's assume that it was doing some pretty

things for for the business.

When you think about that world, you need as a significant amount of software development investment.

Whether you bought it off the shelf or or,

effectively rooted your own, you needed to

invest upfront significantly

to build the platform

before you even brought in the actual

components, be it the data that's flowing through it or the the actual,

business logic that you were writing. And so fast forward to today,

a lot of those capabilities

are available to you. But maybe not a 100%, but I would say 90%

of, what we could possibly want to do in in in software engineering and certainly in data engineering is now available.

And so

my personal philosophy

is that investing and building nondifferentiated

infrastructure

is,

something that you have to consider very carefully

before

you put forth the software development

skills because that one takes away from solving the business problem. But the second part is that it requires a significant and continuous

investment over time. You will never be in a position where you call a vendor or you simply upgrade your software,

by getting a new download from your favorite vendor. Here, you actually have to debug the stack and make sure that,

it really meets your continuous

requirements.

So I personally

am

very much a buy versus build. That being said, it's not the solution for everything. I also have plenty of build solutions.

For example, in the data quality space,

back when we were looking at the vendor ecosystem

and given our requirements,

we really needed to solve it in a different way. And so we embarked on a journey to write our own, and,

it really served us well for that particular

purpose.

In the storage and data warehousing,

we used to have more proprietary

systems. We're now, moving to using, you know, data warehouse solutions like, BigQuery, Query, Snowflake, you name it. Because those also come in wired with a range of other hooks

that you don't have to worry about. So, again, you can put your parquet files in there, and you can hook it with dbt. And it comes with a a lot of, bells and whistles without having to, like, really teach your customers or,

in my case,

my researchers,

or developers how to use that interface.

And then as far as the

architectural patterns, you mentioned the

kind of levels

of completeness or levels of curation for the data. You mentioned that you're standardizing

around more of these off the shelf warehouse components. And I'm wondering if you can just talk through some of the ways that you think about the architectural substrate and then the design patterns about how you manage the data through the various stages of its life cycle.

Yes. So, again, moving to more

common

technologies,

say, data warehouse, what that allows us to do today is

to, one, standardize and normalize on all the data ingestion and bring in the data in its raw format.

When you look back fifteen, twenty years, you had to have a certain shape to your data,

as you brought it in. And when you had to make changes

to the schemas, you had to basically

do that very carefully, one, but two,

chances are you didn't really have lineage in place to know,

what has changed, when, by who. And so

proceeding in that mode back then was much more complicated than it is today. So having

one single data warehouse where all the data is ingested to has accelerated and normalized for us the ability to procure a lot of data from a much wider, you know, vendor sources

without having to worry as much about

the things that come much later in the, workflow of getting data ready. So that would be one. Then we move on to modeling the data and shaping it. And, again, this is something that in the past, we had

to proceed very carefully

because anything that you change might have adverse impact

downstream to customers.

Here in in a platform like BigQuery, you can have

multiple versions and views of the work that you're doing. You can checkpoint it. You can hook it to DBT and actually perform CI and CD. And to me, that's probably the most interesting

shifts

that I see in data and one of the most exciting one where in the past, it was pretty hard to consider your data as code. If you wrote SQL, good luck testing it. Fast forward to now, you have your pandas,

You have dbt. You have capabilities

to basically ensure that you model your data or make any transformations or changes to it, while having a record. And now we are able to actually treat the data as code. So we talked about ingestion into its ROS RAS format. We now have the capability to have multiple users

look at the same data and basically derive

the relevant meaning for them

while we are

focusing on modeling it. We can then take the data onto the next level and start preparing it for simulation.

That one, performance does matter, history matter, quality of the data matters. So we may not do it in our data warehouse because it it may not be meeting our requirements, but we have the ability to actually extract extract it. We create

snapshots

for,

and, again, these are very,

very much standardized so that all of our customers know what to expect, how to wire their

experiments onto our datasets.

And we basically provide them an environment that,

looks and feels like

what one would expect from a research environment. We get the feedback back from them,

rehydrate the data in our data warehouse,

enrich it,

finalize the modeling. And once we have it ready to go,

we then promote it into production. And, you know, the production system is

probably not nearly as

sophisticated,

if you will, with all the research capabilities, but the reality is it doesn't need to be because we're not performing research in production.

And then since the last couple of years, the constant pressure is figuring out the role that AI plays, particularly as more of these agentic workflows become

reasonable to implement

and have better understanding around how they operate. And I'm wondering how you're thinking about the incorporation

of AI utilities

both in the creation and curation

of your platform and your datasets as well as as an enablement

to let your researchers

apply

AI tools to the data that you are responsible for.

Yeah. I find that,

we live in really in in some ways, I feel that anyone right now

in the data space struck gold. These are really exciting times when

the the ability

to accelerate

is like

nothing I've seen in prior years. And, again, specifically,

I'm speaking about data. I'm sure it's true elsewhere. But one of the things that really hindered our ability

to move as fast as we wanted was because you had to really preserve and maintain

how the data looked,

for the rest of the ecosystem.

Migrations were there. And I'm sure it's true also for

infrastructure and what have you, but now I'm looking at the,

agentic capabilities.

And in some ways, we have far more opportunities

to

make

operational

tasks and reproducible tasks a nonissue.

And right there, that opens up an entire area where,

a data engineer no longer has to worry about the mechanics of

operating the plant. They truly can focus on extracting

information

from the data,

which is very nuanced and hard to do, but this is where the time and value is. So the approach that we are taking

on this journey

is, you know, very

measured.

One,

make sure that all the developers,

all the users

in this space

have

experienced,

what it is to use,

these technologies

just

in a very modest way, first for their own personal use. So

developer productivity,

understanding what the boundaries are, understanding the differences between

one model or the other,

understanding, you know, where it's applicable to solve

meaningful problems

and where you end up chasing rabbit holes. So the entire

purpose is to really just get your toes wet, understand what the capabilities are. Step two for us is to identify areas that have proper guardrails

so that we can really measure

the,

the use of,

this technology

in a way that we can feel

comfortable trusting it. So examples, if we do

very common transformations,

you know, so some of our businesses are

have very standard patterns to create their ETLs. What we do is

we,

we use,

these technologies to basically accelerate that entire,

journey. Data issues,

finding gaps,

communicating

with vendors,

or extracting

information from PDF. So all these, you know, areas that traditionally were done by humans,

leveraging

this technology has proven to be not just very effective, but especially at scale. It's a huge time saver for us. And the goal that we have is by

covering these two areas, you now have a more,

sophisticated

data engineer

that understand what the tool sets and capabilities

are. And you also have

freed up enough space by taking on the toil, the real operational,

aspects of what we do in order to now consider

the next frontier in the areas that you want to solve for.

The opportunities

are obviously there there's so many opportunities

to even list here. But from our perspective,

personally, from my perspective, the shift from how we operate as engineers

is a significant one, and I wanna make sure that we do this carefully. Prompting

is

actually

not easy, and you have to spend quite a bit of time making sure that you're giving the right context, making sure you're you're feeding your model the right data and making sure that the work is reproducible because the shift and evolution of this technology is so rapid that I could easily see this becoming

a major toil production,

if you will, in your environment. So we really have to change how we

develop to assume that things will operate and change very quickly and fold them as we move along.

Another

interesting aspect of working

at that foundational layer of the organization

in terms of the technical stack is that, as we've discussed, everybody is going to have their own ideas about what is the best approach for a particular thing, what is the area that the business should be investing in because, obviously, their idea is the most important or most impactful.

And I'm wondering what you see as some of the aspects of the socio technical friction

that you have found to be either most frequent or most challenging to address.

I found that being able to meet my customer's

requirements and pace,

if I were to do it, if I had infinite resources at my disposal,

I don't think

that

the end result will meet their

expectations

in the long run. And the reason I say that again is the notion of having a platform. There is

something to be said if you had a platform that provides data with a very

think of it as a data contract where it's very, you know, well defined what it is my customers are receiving,

where I have guarantees on the quality of the data, and I give even capabilities for you to research much faster by

providing,

the data at a certain shape, along the way. It also

reduces the time

my,

customer will need in order to hook their software

into into the system that we just created had we gone down this path. And so one of the things that I,

have done is to anchor on one or two things and do them really well. Think of it as ice cubes

to basically

counter the snowflake effect where everyone wants to have something very different and unique to them. I find that by and large, if you provide good ice cubes, good patterns, good APIs and contracts for your data, even if it does not meet

the requirements of my customers at a 100%, but only at 90, they will opt to come back into the platform and use it today because it's available

than it is by

having forked. Because now you might have forked. You got off the ground very rapidly, but now you have to build the support function, the life cycle management.

You basically have to take an entire platform journey on this leaf node.

And, obviously,

oftentimes, it's not,

you know, front and center when your customer is thinking about it. And so

it's not something that I can

negotiate upfront.

Oftentimes,

you really have very,

demanding

business needs that you you need to keep. But by having a platform that gives

everyone

what they need by and large, that usually

covers the most ground. So this is where anchoring on technologies

like BigQuery.

The reason I picked it was because I knew that it would be able to meet

most of my customers' need in a relative,

rapid time. It also still solves my needs because I can stay within this platform. So right now, we're looking at the

Google

console offerings, including Gemini assist to see if maybe our data analysts themselves

not having to leave the, BigQuery ecosystem and stay within this context. And, again, all this to basically accelerate. If I'm able to

provide 80% of my my customers' need, I think I'm able to reduce some of that friction.

Another interesting challenge

in terms of operating a platform

is that the boundaries

can become a little bit fuzzy because people want you to take on more responsibility or maybe you want to be able to exert more control over particular patterns

versus

the other trend, which is you want to

seed control because you don't want to be responsible for as many pieces. And I'm wondering how you think about the

definition

and

evolution of those boundary layers as you gain greater operational

capacity or greater comfort with the different workflows and as more workflows become standardized.

Yeah. This is definitely

coming from reliability engineering, that's what we did always.

You had customers on the right who would sell you their amazing

solution that they're gonna pass on to the reliability

team

to safe keep, and then we had to basically walk the journey of bringing it up to the standards. And so in some ways, it's no different

than software. I definitely find that you have customers

that

will basically create a proof of concept

and expect

the platform

to just naturally

absorb it and without any particular cost. I

tend to

pick

teams

that

are willing to work with us to bridge some of that gap. Again, I think

having,

a team that basically

worked off offline and now expects us to absorb their component into the system is not always a bad pattern. It's it's a pattern that could be used for,

driving innovation.

And so when I

make a decision to reabsorb

and basically

assume,

other other team's innovation,

I tend to do that because it also meets some of our engineering's requirement.

There'll be times where it will

basically

stay off if it doesn't meet some of the basic requirements. So for instance,

if, the data that you're producing doesn't meet the quality requirement,

there will be cascading effects on the operations team. On other customers that haven't been considered upfront, we won't be able to take that. And so I basically use judgment when I make those

decisions or trade offs.

I'm a big partner

in that I really love to find opportunities with other teams because,

usually, when you have two needing teams, I could use their technology,

and they could use my services.

You find that there is a much better

outcome at the end of it versus really holding the line on, you know, this is a platform, and unless you meet the platform's requirement, you're out. I just find that that also

risks

teams

forging,

sideways and basically

risking your platform to become

null and void.

And as you have been

responsible for the care and maintenance of such a core piece of the technology stack for the organization

and worked with the various consumers of that platform

to address their use cases. What are some of the most interesting or innovative or unexpected ways that you've either have seen your specific technology stack applied or,

some interesting ideas that you've seen around the pattern of foundational data?

I think that one of the areas that has

started to emerge

given

the turnkey

technologies available, cloud first technologies,

and the agentic

capabilities.

The focus is really shifting

towards

data engineers

having significant

more domain context around the data. If

ten, fifteen years ago, you

you were expected as part of your software engineering role to

build the infrastructure

or at least integrate the infrastructure,

in order to create the platform,

that now

is, for the most part, happening,

in support of your,

business needs. And so our data analyst and our engineers are expected to have real

deeper,

understanding

of the intent of the data

that they are,

working with. And in some ways, we are

elevating

the skills

of our engineers

to work much closer

to our

researchers.

So even in research, not everything is just pure innovation.

Sometimes you have to do a lot of, forecasting

or,

feature extraction.

And so these are areas that we can more comfortably step into in augmenting the datasets with, enriching it with different datasets from other sectors or,

you know, markets, if you will. And so these are areas that traditionally the researcher would do. They would basically own the entire research ecosystem.

Now we can sit much closer alongside them. And this is definitely

a shift

that will make some software engineers

thrive and excited, and others will find

very

challenging and not appealing depending on the the kind of software engineer you are. Are you a builder? Are you a problem solver in the domain? And if so, that will really make,

your your career

aspirations, like, really if you're in one one form or another. So that, I think, is one of the,

biggest shifts that I see.

In your experience of building this foundational data system and working with the organization

to take advantage of those capabilities

and manage the team and the requirements around it, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

Probably the hardest

surprising

challenge that I found

was

looking at financial data in that in some ways, it has

the,

sort of it has a very well defined contract. Everyone in our industry consumes

this data. You know, we all have our

wholesalers that we,

buy from.

It follows a very particular shape.

But, really, under the hood, once it lands with us, I discovered and realized

that

how we use the data, how different businesses use the data requires significant more

domain expertise, and it's a lot more nuanced than I thought going in. One would think that, say, for instance,

I buy historical data about IBM. It follows a very particular shape. It lends chances are everyone in my industry does the same. Well,

maybe not. Maybe I augment the data with additional data,

having to do with

hardware,

purchases

or,

you know, new CPU innovations

in,

in technology. So all of a sudden, the ability to really understand

how to integrate

different datasets

into,

what is really

very course wholesome data,

becomes a lot more nuanced and something that not all of the software engineers

had an easy time effectively

navigating.

And as you

talk to people in other organizations,

as you talk to people who are in peer relationships to you, what are the cases where you would say that the foundational

data team approach

is the wrong way to

address the data needs of a given organization?

What we do, unlike other data

engineering

organizations,

is really spend a considerable amount of time in modeling the data. This is where you augment it with other data sources,

where you understand

both the intent of the data as the world views it, but also in how it will get integrated all the way through our systems,

how it's ingested,

how, say, the ops team clears it, how legal and compliance looks at it, how it's treated, how it's researched. So you have so many different customers

that have to understand

and extract

the meaning from the data. And so

in that

in that way, I find that

it's,

it's a very specific,

and rich

type of space

to be in because the context of what you're delivering matters a great deal. Where it does not fit the same pattern is when we have

very specific ask

from a researcher to a particular dataset.

So it's a one to one mapping. It's not a wholesale function. It does not necessarily

have a very complex transformations,

may not need significant history that's extremely rich and dense. And so in those cases,

foundational data is not necessarily

the right place to solve for these problems. I think of those

type of requests data requests are more

shallow and that there are many of them, but they're relatively

simple.

Simple transformation,

simple

downloads, and very few, if if at all, customers. It might be just one customer at the end.

Foundational

is very few

Datasets,

extremely

sophisticated

in in what they are actually

modeling,

and many customers

along the, workflow,

that I described earlier.

And as you continue

to

invest in and iterate on your data platform

and stay abreast with the technological

evolution of the ecosystem,

What are some of the resources

that you find particularly helpful as you plan for successive iterations of your technology stack, of your platform architecture,

and some of the ways that you're thinking about the

role of the foundational data layer as AI starts to subsume more of the technology ecosystem?

I find it very challenging,

actually, right now to keep up. And and, again, it's because the pace

of innovation

is unlike

what I've seen in the past. It's very exciting. So I spend a significant

time online

reading up and also experimenting a lot myself

to sample out

this new model, this new LMM that just came out. What are the features? Does it actually,

meet some of the needs that we have?

Some technologies

that,

that we're sampling,

we are

trying

to carve out as much time as we can in experimentation.

But the goals that we're setting so that it's not completely open ended is that

it has to,

at

least, aspirationally,

it has to basically pay for itself at the very least, so that we're not completely,

spending our time in r and d and are not actually producing.

So significant amount of time online, outside

in ways

that I haven't done as much in the past. Because I truly feel that if I were right now to not be looking at what's going on in the industry for the next six months, the world is gonna look quite different,

six months from now. So that is one area. I spend a good amount of time with my colleagues. We brains with colleagues, former, current.

We created a lot of working groups where we're sharing ideas, and we're effectively

federating

that research

both inside and out again in ways that I haven't seen before. And I find it extremely helpful because there'll be others who are thinking about the same

problems that we're having solving them in a more innovative way. Perhaps they already solved it. So for example, there is a

surge right now in

MCP service that we stood up, but rather than, like, sending a 100 of them out there through all of our working groups, we created

an actual catalog and enumerated

what those are, how to use them. Basically, we are, almost democratizing

that work and helping each other out to basically get us ahead. And, again, it's it's something that I haven't seen as much, especially in a business context. You're usually sitting in front of a problem, and and you're trying to stay as focused on that. This one is a bit of a game changer in where we're all contributing

and consuming at the same time, and it all helps us to actually accelerate innovation.

Are there any other aspects of the work that you're doing or the ways that you're thinking about the

role and the applications

of foundational data systems that we didn't discuss yet that you'd like to cover before we close out the show?

I think that

the

evolution of data from what it was,

say, twenty years ago, where it was more of a utility

and the outcome of all the software development that, one would write. Fast forward to today, data is the product by and large. It's

front and center. It basically

has its own pillar in most engineering organizations.

You see other areas in engineering

starting to

shift aside or taking just a different shape, whereas,

data becoming

really the core of what most businesses,

rely on. And I think these are absolutely

exciting time to be in data

space. I see data always have, but now we also have the technologies

to truly treat it as code. And with the proliferation

of the agentic technologies,

you

definitely have the opportunity to spend a lot more time

in deriving

information,

not just producing data. And that is something

that gets me very excited because,

again, ten, fifteen years ago, in order to play in the data space, you had to really carve out a significant amount of life cycle. That now

is shrinking, giving,

one the opportunity to truly treat data as a living, breathing, evolving,

shifting,

you know, entity that fuels a lot of ideas. And,

I think the opportunities

are limitless.

Very exciting.

Absolutely.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

The biggest

gap that I struggle with still is

good

and pragmatically

used

lineage,

solutions. And the reason I'm calling out,

lineage is is a big gap is that with the evolution

of,

data and the the capabilities

that it offers,

you can no longer expect

that the schema will be the same

ten years from now, a year from now, a month from now. The

transformations

are becoming a lot more sophisticated.

The producers, consumers,

that

sort of contribution pattern

is increasing dramatically.

And so to manage a complex meaningful dataset without

fully having introspection

to the various checkpoints along the way and that led to the production

of that meaningful dataset is,

becoming

effectively like a no op. So good lineage systems.

The reason I find that still a gap is it's almost like back in the day in the operating system world, you had technologies like DTrace,

where you needed to have

really a PhD

in order to fully understand

why your server,

behaved a certain way. So the premise was phenomenal. The implementation

really required significant depth. I find that in some ways, it's not quite as complicated

on the lineage side, but we need to be able to

hook to both existing

and, obviously, living, breathing already,

built,

datasets

so that you are able to really,

shape the the data into,

future use cases that you can't consider today. You know?

Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Two Sigma and your overall approach to building that foundational

data team and the platform approach to data systems. I appreciate the, time and effort that you're putting into that, and I hope you enjoy the rest of your day.

Thank you so much, Tobias.

Thank you for listening, and don't forget to check out our other shows. Podcast.net

covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com

with your story.

Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.