Understanding The Immune System With Data At ImmunAI

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking,

object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

RudderStack provides all your customer data pipelines in 1 platform.

Easily sync data in and out of your warehouse, data lake, product tools, and marketing systems.

State of the art Event Stream, ETL, and Reverse ETL enable powerful use cases, and flexible APIs allow you to integrate with your existing workflow.

Sign up free or just get the free t shirt for being a listener of the Data Engineering podcast at dataengineeringpodcast.com/rudder.

Your host is Tobias Macy. And today, I'm interviewing Guy Yachtav, director of software engineering at Immuni, about his work at Immuni to wrangle biological data for advancing research into the human immune system. So Guy, can you start by introducing yourself?

Sure. And thanks for having me at Dubai's.

Guide or a software engineering here at Immuni.

Been doing data science and data engineering for about 20 years now. And do you remember how you first got involved in the area of data? When I got into into data and genomics, especially around the 2000.

Was like really the first days of the human genome project. Like the data coming out of the human genome project just came out and

everyone in the community were very excited about being able to access this data and analyze it.

I heard from a friend about this new field of bioinformatics,

which is basically the field of science in which you analyze genomics data

and you build software to be able to both analyze it and make sense of it. Got John into it, so I joined a the Columbia Genome Center.

And basically back then we started building everything from the grounds up. These were still the day that there weren't a lot of like off the shelf tools.

We pretty much had to build everything from the from the server racks

all the way up to even, like, operating and, like, adapting the operating system to be able to to wrangle the data that we were working

with. We built our own scheduler. We built our own orchestration system.

We built our own caching system, our own data lake on warehousing in a way that our own warehousing solution. Everything was kind of like self made and we kind of, like, figured things as we went.

And so in terms of

the Immuni platform, can you give a bit of an overview about what it is that you're building there and some of the story behind that and some of the specific challenges that are posed by biological data to be able to use it in this sort of big data machine learning context?

Let me give a bit of an overview of what Immuni is actually doing and kind of, like, the idea behind it.

And maybe even in the broader scope, talk a little bit about the biopharma

industry and where we at right now. Thinking it's not a big secret that the era of blockbuster drugs is kind of like behind us and

immunize built around the idea that the next major breakthrough

in drug discovery is gonna come from data mining, the ability to take big data and find relevant and interesting biological

insight. Specifically,

the idea of looking into what is called single cell data

is relatively new. Probably should explain what single cell data is. If you were to take a microscope

immune system

and really get down to the gene level,

you would be able to actually profile

what is the current state of a cell, whether the cell is what it's called in its normal form or is it current in an abnormal state.

That would give you an indication

to whether or not the cell is responding to a current treatment,

whether or not the cell is acting abnormally

given certain indication, a certain disease.

And if you would look into this map of cells,

you would start to get a better idea

of what is the reason for a certain disease or what is the reason that a certain patient

is responding or not responding to the treatment that they're receiving?

So Immuni is really built around the idea that using single cell genomics

gives us a much better lens

into the biology and into the mechanism that are behind diseases and behind the effectiveness of the treatments that we use.

And this whole notion is basically built around big data.

The idea behind Immunize is that we

not only are analyzing data, we ourselves are actually

generating the data. We are completely vertically integrated,

meaning that

we have our own facilities

to generate the data through genomic sequencing.

We have our own capabilities

to fine tune the way that we generate this data

so we can very precisely and accurately

analyze this data. In terms of the

types of

information and the formats that you're working with, I'm curious if you can give some detail there where

a lot of people might be familiar working with tabular structures or text arrays or things of that nature. And I know that with biological formats, there are usually going to be

incredibly long sequences of text for things like genomic sequences,

or there are custom data formats to be able to encode some of the information about things

like DNA

or RNA sequences or some of these other sort of specialized data types. I'm curious if you can just talk to some of the types of information that Immuni is working with and some of the specialized capabilities that you've had to build out to account for them. Yeah. So that's exactly right. I mean, we start at the very rawest form. We're looking at DNA sequencing, which is basically

just very long strings made out of a, b, c, g.

And that's kind of like the data that we get out of the instrumentation, out of our measuring instrumentation at the the sequencing lab.

And

unlike other analytics operations,

we have

a lot of work to do in order to get to the point where we can actually start making sense of this data.

There's a lot of constraints

that the experimental side, that actually the lab side, is required to go through. And that translates into the way that the data

is set up. For instance, 1 of the characteristics

of single cell data

is that we

are processing samples. We're processing biospecimens. And those biospecimens, in order to save costs, what we do is that we mix those biospecimens

together and run those through the sequencing machine.

When this comes out as data from the sequencing machines,

we actually have to go and separate those DNAs for the different biospecimens.

Then this obviously requires a whole battery of algorithms that do the job and actually sort the biospecimens.

There's a lot of noise, obviously, that is

added to this process and we have to denoise it. And only after we separate the biospecimens,

we denoise it, we clean it out, and we got to the point where we and we actually sorted out the data that is up to quality, up to analytic quality that we need, only then we can actually start looking and making sense of this data. In terms of scale, we're talking about close to a terabyte,

of data from those studies that we look at. It's quite a lot of data that we go through. And in terms of variety,

the

while the raw data itself is pretty homogeneous and like they said, it's like, oh, we're only talking about DNA and RNA sequencing.

When we're actually looking at at trying to make sense of this data, that's the point where a heterogeneous data is coming in. We're talking about public and private databases that we're looking at. We're talking about imaging data that is involved.

And all of that need to be integrated integrated together, which obviously also brings a host of issues that we can discuss a bit later. But all in all, in terms of formatting, we start off at a very broad level. We make sense of this data. And we end up with what is called and what is very much accepted, which is very much practiced in our field,

Seurat object. This is basically

a gigantic dictionary,

that happens to be encoded using the R language

and is accessible through specialized

R library.

As to the types of

questions that you're trying to answer from this data, I'm curious if you can give some examples

of the

ways that you need to be able to manipulate this raw information to be able to answer these questions and some of the derivations

that you're looking to be able to provide to some of the downstream consumers and maybe talk a bit about who the consumers are of this

processed

data. I'm going to talk about 3 main users that we have. 2 of them are more in the biology,

side of things. And then we have the machine learning engineers that are looking into the data. We'll all start with the immunologists.

They're very much concerned about, high level

questions about the biology. For instance,

they wanna design an experiment

that would help them figure out which drugs can they match a certain

disease. Like, which drug would be the most effective on a specific patient with a specific genome profile.

And

what immunologists

are interested in is looking into large datasets

and kind of like finding the short list of those drugs that would not destroy a certain disease. And for that, I think that just a fairly simple data repository

does the trick there. Where it's become a little bit more complicated is when you're actually talking to a different set of user, a different class of user. These are the computational biologists

algorithmic and the big data side.

And computational biologists

are very much interested in performing

deeper analysis and they can ask questions such as, in which disease can I find a gene that is overexpressed?

We can find the RNA sequencing

at a larger amount than normal.

And in order to be able to perform such an analysis, obviously, they need to draw large amounts of data and start comparing between different datasets. And for that, obviously, we need our data warehousing and downstream analysis to run-in place.

And for the 3rd class, our machine learning engineers,

the use case there is a little bit different.

They're less interested in looking at specific studies, but they're looking at

a cross study or a cross project, cross dataset

analysis

where they actually use our data mart to filter

and generate

specific

datasets so they can answer questions

such as

how do I assign

a specific

cell type label

to a cell that I'm seeing in the sample?

Or they would look at our data and they would try to predict basically

what would happen

if I would knock out a gene, if I would change a specific gene in a specific location.

What would happen to the cell? Would it impact it in a way that can actually trigger an immune response or cannot?

And for that, as I mentioned, they would require large data. That's part of the reason why Immuni was established

and actually decided that we're gonna be able to cover the entire gamut from sample and sequence generation

all the way up to analysis.

So we could actually control the process of generating those datasets.

As far as being able to

manage the engineering around working with all of these datasets and working with processing it and providing an appropriate sort of representation

to these different stakeholders.

I imagine that that requires a certain amount of domain expertise

among the engineering group to be able to know, you know, this transformation is valid given this context. But if I, you know, happen to manipulate this source format in such a way that these sequences of letters get transposed, then it's going to be completely garbage.

And so I'm curious how you've approached that aspect

of bringing in the appropriate domain expertise without necessarily having to have everybody on your team be an expert

immunology

or single cell genomics?

That's an excellent question.

As I mentioned earlier, there's 2 major types of biologists,

within the team. We've got the immunologists,

which are domain experts on the immune system. And we've got computational biologists, which are domain experts basically, or which are kind of like a bridge the gap between

the biology side and basically working with algorithms and big data.

And then we have the software engineers, which basically build the whole data infrastructure

and put it in place.

And

these 3 classes of experts

obviously need to communicate and work together to build up the tools that we use at Immuni.

And in order to do that, you have to maintain a certain degree of both integration

of the different disciplines

while keeping a good separation

that would allow each of those disciplines to operate really. And the way that we did this, and I have a good I think I have a good example to demonstrate this,

is within our data processing pipeline

that basically combines

sequencing data that needs to be parsed and analyzed eventually by immunologists,

and also need to be processed by algorithms that the computational biologists

are developing.

And also needs to be assessed for quality by the computational biologists that are looking into this data.

And then the infrastructure need to be built by software engineers, as I mentioned earlier.

To that effect, we first created a team that is very heterogeneous. We have a team that is composed of all 3 classes of experts.

But on the technical level, we're using tools that allow us a certain degree of separation

between, for instance, the orchestration

and actually the algorithmic layer and the analysis layer. What is interesting, by the way, is that while the infrastructure

itself is built in Python,

all the algorithmic and analysis layer at Ini and I are built using the R,

the R language,

which is specifically suited for analyzing single cell data. This is in itself obviously poses a challenge for building big data analysis systems.

The toll on actually standing

a big scale and robust data system is pretty obvious there.

But we've actually adapted to this situation when we've actually built really nice solutions that maintain

keeping track of, for instance, like, version and compatibility between the 2 different languages.

We've built nice bridges between the Python and our side. We've actually set up the software development cycle

such that computational

biologists could go in and develop their own

specific r algorithm

without actually being concerned with everything that has to do with the orchestration, with monitoring, with logging

and everything that has to do with running the analytical piece.

For that

communication

at the language level, are you able to take advantage of

the arrow format for being able to hand off the in memory data structures? Or are you doing it where you're maybe serializing over the wire for being able to send information to the r runtime

and then being able to shuttle it back to the Python

layer? Yeah. That's exactly right. We have basically a broker library that that's shuttling data back and forth between Python and R. We're also we're using a mechanism that allows us to

propagate any feedback coming from, from Doctor back to the Python, and we could actually show it on our orchestrator.

As far as the

challenges

that are unique to biological

interesting examples of

complexities that you've had to deal with because of a lack of kind of generalized support in the tooling where

if you're working with

clickstream of data, you can easily just put it into Kafka

and process it with Spark or something. But if you're working with, you know, this genomics data, you might not be able to have as much in the open source ecosystem to pull from and curious

how that has manifested as far as where you've invested your engineering time and how that boils down in the build versus buy decision.

By the way, Pritas, and and say that because of my, previous life at the Columbia Genome Center, working with big data and kind of, like, inventing the wheel or reinventing the wheel back then, we've seen a lot of the tools or the type of tools that we use today for streaming and orchestration

actually being built at genomics labs just because there wasn't anything on the shelf back then. So a lot of those solutions can trace or some of those solutions can trace their origins to the genomics labs.

Unfortunately,

the reverse did not happen. So we didn't have, like, the general purpose tool becoming

good solutions for biology.

We do know that many vendors have tried to use kind of, like, genomics

to kind of, like, present use cases in which they're kind of, like, showing the capability and dexterity of, for instance, like, their cloud services or their warehousing

capabilities.

But we haven't seen something that is really

tailor made for genomics that is actually able to encompass the need in genomics. And therefore, there's a lot of effort on our side to cover this gap.

I can give 1 example, for instance, for a specific algorithm,

core algorithms in our processing pipeline

where we actually feeding a set of raw data or in a sequencing. And we're trying to basically base on this sequence, DNA sequence, for instance,

we're trying to get the label of the gene. We're trying to understand which gene we're looking at. And the algorithm itself

is based on clustering, in which we're looking at this long genomic sequence and using clustering algorithms, we identify the closest the closest patch. And there you go in getting the gene name out of that. However, it does a whole lot more than just labeling. As I said, there's a lot of, like, data cleaning and a lot of gymnastics that are being done with this data. And this entire entire algorithm is being developed and maintained by a single vendor.

And that specific vendor has a sequencing

operation as

well. And obviously, the software does play a big role in their offering, but it's not their major or core business.

So while there are putting a lot of attention into developing

this tool,

it still lacks and it still does not fully advanced and fully capable

as you would find

similar solutions in different verticals.

To add on to this, there's not too many competitors in the market for this specific capability,

which basically

doesn't leave a whole lot of room of actually deciding which algorithms you're gonna be using. You're kinda like locked into this specific algorithm because you're using the sequencing format that this specific vendor is using and because there's not a whole lot of other opportunities in the market.

So again, this basically the mitigation the mitigation there is that we either build extensions

around it.

If we do get access to the source or we build some kind of wrappers around it to add to the capabilities and we're trying to do the most out of it. 1 of the interesting thing is that it does get the team much more challenged and come up with a more creative solutions than with the usual tools. So the team is less spoiled.

Today's episode is sponsored by prophecy dot I o, the low code data engineering platform for the cloud.

Prophecy provides an easy to use visual interface to design and to deploy data pipelines on Apache Spark and Apache Airflow.

Now all the data users can use software engineering best practices.

Git

tests and continuous deployment with a simple to use visual designer.

How does it work? You visually design the pipelines and Prophecy generates clean spark code with tests and stores it in version control, then you visually schedule these pipelines on Airflow.

You can observe your pipelines with built in metadata search and column level lineage.

Finally, if you have existing workflows in AB Initio, Informatica, or other ETL formats that you want to move to the cloud,

you can import them automatically into Prophecy, making them run productively on Spark. Learn more at dataengineeringpodcast.com/prophecy.

And I know that the overall field of bioinformatics

has been

growing a lot in recent years and that a lot of that has happened in the open source ecosystem.

I'm curious if there have been any

kind of emergent standards that have grown out of that growth

and some of the ways that your experience at Immuni

maps to the ways that the broader community is starting to approach these problems and some of the

ideas that you

have benefited from and some of the ways that you think the broader community

might be able to improve or

ways that you might be able to

kind of refactor your overall approach to take advantage of some of these community supported

tools or platforms?

You're absolutely right. There's a whole host of open source software and open source tooling that is offered

not only at Immuni, but also at the Columbia Genome Center. We've looked into those tools, and some of those tools are actually kind of like the bedrock of the science itself.

For instance, as I mentioned earlier,

the analytics object that we're working on, which is named Seurat, this is the main object that represents,

single cell data within our field. There's an open source library that allows you to both generate access and modify it.

There's a big community that is built around it. Obviously, it comes with the weakness or the weakness, but it was developed in R because it was developed in R. There's a whole host of issues of basically how do we productionize

doing analytics over the subject?

I should add something that, at Amy and I, 1 of our

main occupation is actually industrializing

single cell data. So

we don't just look at 1 studies. We actually have a well functioning,

well oiled machine to generate and churn out single cell data at scales that are not being done almost anywhere else.

And to do that, obviously, we need to do a lot of automation. And we need to basically solve a lot of problems that, in the community itself,

researchers are not really facing because they're working with 1 SIRADA object. So they're working with 1 study.

Whereas we work with multiple that obviously bring the question of scales, heightens the question of variety of data and and managing this data. And in that sense, it's basically on us to start building the tools

around it. In that sense also,

those solutions are not available in the open source community.

Another aspect of what you're doing is, as you mentioned, you have these multiple domain experts within your team, and then you're also

interfacing

with people who are doing

active research, people who are trying to do analytical

applications, people who are trying to build machine learning models.

And I'm wondering if you can talk to some of the ways that you've architected your overall platform for being able to

service all of these different use cases and some of the ways that you're able to converge in terms of the data outputs that can provide

the utility to all these

end users and some of the ways that you've had to specialize some of the output to be able to fit those different use cases?

Let me say a couple words about the architecture of the platform in general. I guess like 3 or 4 main areas for the platform.

We mentioned earlier, we have the area of data processing.

Then we have the area of data cataloging,

which I'll touch on in a second, then we have the warehousing.

And on top of all of this, we have the analytical

applications.

This conceptual separation already allows us to build the platform and then make it such that it could serve the different use

cases. However, there's many complexities emanates from basically having so many different disciplines and so many use cases

that are trying to look at the data, investigate the data from many different angles.

So I mentioned earlier that we have those 2 classes of biologists. We have the immunologist.

We have the computational biologist, but I kind of, like, oversimplified.

So, obviously, within the immunologist community, we have different types of immunologists that are asking fairly different questions. And within the computational biologists,

we have different types of computational biologists that are looking to do different things,

and those present different use cases.

In order to be able to address those different use cases, we've converged on the minimal set of data that we need to to process in a single way and assess

its quality level and bring it to a quality level.

We cannot continue and analyze it. And that's kind of like provides the core of our data repository.

So this is what we call the annotated multi and multiomics

immune cell Atlas or the abbreviation is AMICA. That's our main data repository.

And this is the sum of the data that has been aggregated at Immuni

and has been assessed to reach a certain quality level that can be consumed for analytics and other use cases.

Now the maker represents the data that we generate at Immuni.

The core of Mika's represents the data that we generate at Immuni. However, we can enrich this data from public sources

and other sources and other databases

to help the analytics process.

And this requires that we have a whole layer of data integration

that requires another domain expert. These are the bio curators. These are experts that can help us

figure out questions such as standardizations

around gene names.

So gene names, while there is some standards for naming genes in central databases,

Obviously, those standards cannot be enforced,

and therefore, there could be many different aliases

and synonyms for this unit of analysis that is called the gene. For clinical data, we can process samples that are being defined by the clinical experts

in 1 way and could be defined in a different way by other

experts pending the disease area or pending

the treatments that are being administered.

And for that, in order to actually mold or meld everything

together and bring it to the point that it could be

analyzed, we need the help of bio curators

who build ontologies

that help map this data to 1 standard that we maintain

at Immuni.

So by taking Amica and taking the standardized data that the bio curators

bring in, we can actually expose the data and make it available to the different use cases.

There's many different requirements for the way that we actually do a computation. So as I mentioned earlier, we can serve a very simple use case of basically doing a simple search

over data repository. And that's basically just picking out and filtering

some metadata churn, matching them up against the data warehouse,

and doing some aggregate calculation on the top of that.

There's the more complicated part where we are actually starting to look into large

of what we call gene expression. That's basically

what is the amount of RNA sequences that we find in a specific cell. And if you want to start

comparing between

the gene expression

of 1 study

versus the gene expression of another study or maybe we have 1 study that has like different conditions that are being administered in this study and we want to compare between the 2, then basically in that sense we are required to do heavy computation

that we run on top of our data warehouse,

which is basically kind of like a 2 step process.

We do require to do some heavy lifting and filtering on the data warehouse

and then feed over to the downstream analytics computation

that

happens to run also in R.

This computation being done basically on the data frame that is exported from the data warehouse.

As far as the

requirements that you have in terms of timeliness or latency, I'm curious

what you're working with there and

the

frequency of updates that you're getting from your source systems and some of the just

latencies or some of the timeliness that you need to consider and the end to end flow from

collecting the source data through to delivering it to somebody who wants to

process it in their machine learning model or be able to build some analysis or do some ad hoc research on it? We're a batch processing operation,

and we're working with sequencing data that can take up to 10 days on the sequencing side. Where

the data engineering effort comes into play and where it actually adds a lot of value

is by creating this automation. This automation that actually takes this raw data

and churn out the best quality data that we can get out of this raw sequencing.

In that sense, well, the time frequencies the time intervals that we need to work with need to be within a couple of days turnover,

we're not really required to have like data available instantaneously.

Because of the fact that the

sequencing process and just the overall time frame of being able to generate new datasets, I'm curious

how that factors into the work that you're doing on the data engineering side, where is there anything that you've had to do with your data providers to say,

you know, this is the optimal output format that you can produce that makes it easier for me to consume it or some of the

quality checks that you're able to execute early in the process or if you're able to get any sort of intermediate results from that sequencing to be able to understand,

you know, this is a good run. We're going to let it go through to completion or, you know, there are some errors that are happening in the data generation. So we're going to abort and start over so that we don't have to waste the entire 10 day cycle?

Right. So QC, quality control, is an integral part of our processing operation,

And we never throw out, like, complete batches of data, but we could mark a specific chunk of the data at a specific quality level that we may decide not to run analytics or may not be good enough for a specific study. What is very interesting

is that when we run both the experiment, when we design and control our own experiments, and we can actually measure the quality of the data

on the computational

side, we could always inform

the experimental side either on how we can improve for the next time

or basically if a certain sample was not processed up to quality,

we could always restart

with new configuration

and improve the quality.

So the very interesting cycle of actually being able to control both sides, both the experimental and the computational

side, and making sure that the experimental side is very much data informed.

It's not that we are just running the experiment and be done with it and let somebody else analyze the data. We have the full circle there that we can actually run, process, and then inform whether or not we want to improve the quality.

As far as

the

overall life cycle

of data and some of the ways that

the kind of data generation process happens. You mentioned that it's data informed, and so you have researchers who are analyzing the outputs of previous runs, and then they decide

what is the next batch of data that I want to generate and analyze. And I'm just curious if you can talk to some of

the overall

kind of collaboration

and communication aspects of what you've built to work with these researchers to help them understand,

you know, this is an area that's worth exploring more or

how they might talk to you to understand

what are some of the ways that I should be thinking about structuring this experiment to be able to get the most useful data out of it and just that overall workflow and life cycle?

We get a pretty good signal. For instance, when we do the quality control

analysis, we get a pretty good signal

as to where in the experimental side we could have, like, done things differently to get a better quality.

But not just on the experimental side, we can also look at what the ML folks are getting from us. So for instance, 1 of the denoising

efforts that we do at Timmy and I

is what is called removing doublets.

Sometimes in the sequencing process, you can see the same cell twice. You can count the same cell twice.

That is a very common phenomenon in single cell sequencing.

To compensate for that and to correct for that, we're using a set of analytics, including machine learning algorithms that actually identify

those cells that are too similar and that we're counting them twice.

1 of the nice things that we can do at Immuni

is actually generate datasets

for our machine learning experts

that we capture those doublets, that we capture those cells that we notice. We can label them as doublets.

And then the machine learning experts can actually train and improve the algorithms that identify those doublets.

And in turn, basically, we're getting a much better process of actually cleaning out and denoising this this data.

In terms of the

ways that the

data that you're generating is being applied or some of the

workflows that you've built at Immuni or some of the

specific technical complexities you've had to deal with. I'm just wondering what are some of the most interesting or innovative or unexpected

either applications of your platform that you've seen or solutions that you have seen built internally to address this problem space?

There's 2 that I think I can mention here that would be interesting.

On the

engineering perspective,

being able to

harmonize or make sure that both the production

codes, which is written in Python,

and the research code that is written in our coexist in our ecosystem has required us to be very innovative in the way that not only the way that we engineer our systems or the architecture for the system, but also on the DevOps side.

1 solution that we came up with, maybe I should first address the problem, is that when you're actually building a large data engineering system that evolves over time, the integral part of the system is based on R, you run into a lot of compatibility

issues. R is a very nice language.

If you have it contained to your own workstation and you are basically working on analysis

tasks.

But when you're trying to stay at a large system

and that system

continuously evolves,

then R is a bit lacking on that because R is not really managing versions in the way that we used to manage versions.

We built a pretty innovative solution around that

using some date snapshot system in which we kind of like

reach out to some repositories that actually keep track of the different versions of different, R packages given a specific date.

And we have to match that with the state of the operating system and the system libraries that supported that.

And together, combining this configuration,

combining the fact that we were able to pick out a specific R library from a specific date and the supporting system libraries and system packages,

we are able to work to get a solution that would actually

guarantee to work at any given time and be compatible

with other libraries.

This has been an an innovative solution that to come up with out of necessity.

That's on the engineering and the DevOps level. On the ML level, something that we do that is very unique to Immuni

is the ability

to project

from the large scale data that we generate,

project from this larger distribution of

cells,

we can project to more rare examples.

So

obviously, rare, cell types are more common.

We can profile them more easily, but for those cell types that are quite rare, it's harder to actually profile them. We use advanced machine learning to be able to to predict basically how those cells would behave under specific conditions.

And in your own experience of building out this system and working in this problem space, what are some of the most interesting or unexpected or challenging lessons that you've learned?

The industrialization

of single cell genomics

has basically posed

many challenges for us. I mentioned the fact that we had to address

coexisting

and production language, Python in that sense. That we had to take different disciplines and work together in order to build those solutions,

that we had to make sure that they're all communicating

the the same language.

And they're all being able to see

and understand

the computational processes that we're running. And in that sense, I did not mention, but the orchestration tool that we are using is Daxter developed by MENTL.

1 of the reasons that we've decided to use Dagster as a core

platform

in our stack

is because we really like the capability of Dagster.

Dagster, through its very nice UI, is able to make very explicit the architecture

and the process and the workflow

of data processing,

not only for the software engineer but also to the computational biologists, the immunologist, and the ML person so we can all together, like, huddle around that processing workflow

and make tweaks to it and decide on, like, how do we want to run this specific dataset? How do we wanna, like, tweak the different parameters

in the way that we actually wanna generate the features for our next ML model?

Making it explicit and putting this architecture

in front of all of the constituents, in front of, like, all of the experts

has made a significant

improvement in the way that we communicate and the way that we collaborate at MNI.

As you continue to build out the platform and work with your stakeholders and evolve the platform, I'm curious, what are some of the things that you have planned for the future of Immuni or

some of the ways that the evolution of the data ecosystem

and the bioinformatics

space are

influencing the way that you think about this problem or introducing new tools that you're excited to explore and possibly incorporate into the work that you're doing? 1 of the nice thing at Immunize that, man, I'm not gonna be saying is a slogan. That's really how we were operating. We're operating in a cutting edge field, and we have to experiment with cutting edge technology and we have worked with basically

Any new innovation that is coming either from academia or from industry or you know, obviously a homegrown

as well because we're still in the discovery phase of what we can do with this type of data.

And going forward, the future of Immuni is really about generating more data,

integrating public data. We've recently acquired a Swiss company

that is aggregating

all the single cell data that is available

in the public domain. What they're able to do is ingest

studies that are being done in research institutions

and being published. They're ingesting it and curating it and aggregating into the largest

public data repository

for single cell data.

So we're in the process of actually integrating this repository

into our operations

so our data community could actually tap into this resource and use it and combine it together with the proprietary data that we generate in house.

So, yeah. The future is basically,

more data, more public data, more sources,

new different type of assays, different type of experiments that we are running,

Probably many new algorithms that we're gonna be adding into denoising

and increasing the quality of our data.

And definitely

more

machine learning and data modeling being done to generate more insights of this data as well. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I think that the bioinformatics

tooling is still very much lacking from the data engineering community.

There hasn't been

a whole lot of attention being devoted to the needs of the genomics community. I know that the perception is that there's a lot being done, but I think that, you know, speaking from kind of like inside, I can tell you that we're still adapting

general purpose tools to our specific needs.

Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Immuni. It's definitely a very interesting company and an interesting problem space, and I think that you've built a very

impressive engineering capacity to support it. So I appreciate all of the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Thanks, Atobias. Thanks for having me.

For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, the innovative ways it is being used.

And visit the site at data engineering podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find

the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links