Summary
The life sciences as an industry has seen incredible growth in scale and sophistication, along with the advances in data technology that make it possible to analyze massive amounts of genomic information. In this episode Guy Yachdav, director of software engineering for ImmunAI, shares the complexities that are inherent to managing data workflows for bioinformatics. He also explains how he has architected the systems that ingest, process, and distribute the data that he is responsible for and the requirements that are introduced when collaborating with researchers, domain experts, and machine learning developers.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
- Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.
- Your host is Tobias Macey and today I’m interviewing Guy Yachdav, Director of Software Engineering at Immunai, about his work at Immunai to wrangle biological data for advancing research into the human immune system.
Interview
- Introduction (see Guy’s bio below)
- How did you get involved in the area of data management?
- Can you describe what Immunai is and the story behind it?
- What are some of the categories of information that you are working with?
- What kinds of insights are you trying to power/questions that you are trying to answer with that data?
- Who are the stakeholders that you are working with and how does that influence your approach to the integration/transformation/presentation of the data?
- What are some of the challenges unique to the biological data domain that you have had to address?
- What are some of the limitations in the off-the-shelf tools when applied to biological data?
- How have you approached the selection of tools/techniques/technologies to make your work maintainable for your engineers and accessible for your end users?
- Can you describe the platform architecture that you are using to support your stakeholders?
- What are some of the constraints or requirements (e.g. regulatory, security, etc.) that you need to account for in the design?
- What are some of the ways that you make your data accessible to AI/ML engineers?
- What are the most interesting, innovative, or unexpected ways that you have seen Immunai used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working at Immunai?
- What do you have planned for the future of the Immunai data platform?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
RudderStack provides all your customer data pipelines in 1 platform. Easily sync data in and out of your warehouse, data lake, product tools, and marketing systems. State of the art Event Stream, ETL, and Reverse ETL enable powerful use cases, and flexible APIs allow you to integrate with your existing workflow. Sign up free or just get the free t shirt for being a listener of the Data Engineering podcast at dataengineeringpodcast.com/rudder. Your host is Tobias Macy. And today, I'm interviewing Guy Yachtav, director of software engineering at Immuni, about his work at Immuni to wrangle biological data for advancing research into the human immune system. So Guy, can you start by introducing yourself?
[00:01:40] Unknown:
Sure. And thanks for having me at Dubai's. Guide or a software engineering here at Immuni. Been doing data science and data engineering for about 20 years now. And do you remember how you first got involved in the area of data? When I got into into data and genomics, especially around the 2000. Was like really the first days of the human genome project. Like the data coming out of the human genome project just came out and everyone in the community were very excited about being able to access this data and analyze it. I heard from a friend about this new field of bioinformatics, which is basically the field of science in which you analyze genomics data and you build software to be able to both analyze it and make sense of it. Got John into it, so I joined a the Columbia Genome Center.
And basically back then we started building everything from the grounds up. These were still the day that there weren't a lot of like off the shelf tools. We pretty much had to build everything from the from the server racks all the way up to even, like, operating and, like, adapting the operating system to be able to to wrangle the data that we were working with. We built our own scheduler. We built our own orchestration system. We built our own caching system, our own data lake on warehousing in a way that our own warehousing solution. Everything was kind of like self made and we kind of, like, figured things as we went.
[00:03:06] Unknown:
And so in terms of the Immuni platform, can you give a bit of an overview about what it is that you're building there and some of the story behind that and some of the specific challenges that are posed by biological data to be able to use it in this sort of big data machine learning context?
[00:03:25] Unknown:
Let me give a bit of an overview of what Immuni is actually doing and kind of, like, the idea behind it. And maybe even in the broader scope, talk a little bit about the biopharma industry and where we at right now. Thinking it's not a big secret that the era of blockbuster drugs is kind of like behind us and immunize built around the idea that the next major breakthrough in drug discovery is gonna come from data mining, the ability to take big data and find relevant and interesting biological insight. Specifically, the idea of looking into what is called single cell data is relatively new. Probably should explain what single cell data is. If you were to take a microscope immune system and really get down to the gene level, you would be able to actually profile what is the current state of a cell, whether the cell is what it's called in its normal form or is it current in an abnormal state.
That would give you an indication to whether or not the cell is responding to a current treatment, whether or not the cell is acting abnormally given certain indication, a certain disease. And if you would look into this map of cells, you would start to get a better idea of what is the reason for a certain disease or what is the reason that a certain patient is responding or not responding to the treatment that they're receiving? So Immuni is really built around the idea that using single cell genomics gives us a much better lens into the biology and into the mechanism that are behind diseases and behind the effectiveness of the treatments that we use.
And this whole notion is basically built around big data. The idea behind Immunize is that we not only are analyzing data, we ourselves are actually generating the data. We are completely vertically integrated, meaning that we have our own facilities to generate the data through genomic sequencing. We have our own capabilities to fine tune the way that we generate this data so we can very precisely and accurately
[00:05:40] Unknown:
analyze this data. In terms of the types of information and the formats that you're working with, I'm curious if you can give some detail there where a lot of people might be familiar working with tabular structures or text arrays or things of that nature. And I know that with biological formats, there are usually going to be incredibly long sequences of text for things like genomic sequences, or there are custom data formats to be able to encode some of the information about things like DNA or RNA sequences or some of these other sort of specialized data types. I'm curious if you can just talk to some of the types of information that Immuni is working with and some of the specialized capabilities that you've had to build out to account for them. Yeah. So that's exactly right. I mean, we start at the very rawest form. We're looking at DNA sequencing, which is basically
[00:06:31] Unknown:
just very long strings made out of a, b, c, g. And that's kind of like the data that we get out of the instrumentation, out of our measuring instrumentation at the the sequencing lab. And unlike other analytics operations, we have a lot of work to do in order to get to the point where we can actually start making sense of this data. There's a lot of constraints that the experimental side, that actually the lab side, is required to go through. And that translates into the way that the data is set up. For instance, 1 of the characteristics of single cell data is that we are processing samples. We're processing biospecimens. And those biospecimens, in order to save costs, what we do is that we mix those biospecimens together and run those through the sequencing machine.
When this comes out as data from the sequencing machines, we actually have to go and separate those DNAs for the different biospecimens. Then this obviously requires a whole battery of algorithms that do the job and actually sort the biospecimens. There's a lot of noise, obviously, that is added to this process and we have to denoise it. And only after we separate the biospecimens, we denoise it, we clean it out, and we got to the point where we and we actually sorted out the data that is up to quality, up to analytic quality that we need, only then we can actually start looking and making sense of this data. In terms of scale, we're talking about close to a terabyte, of data from those studies that we look at. It's quite a lot of data that we go through. And in terms of variety, the while the raw data itself is pretty homogeneous and like they said, it's like, oh, we're only talking about DNA and RNA sequencing.
When we're actually looking at at trying to make sense of this data, that's the point where a heterogeneous data is coming in. We're talking about public and private databases that we're looking at. We're talking about imaging data that is involved. And all of that need to be integrated integrated together, which obviously also brings a host of issues that we can discuss a bit later. But all in all, in terms of formatting, we start off at a very broad level. We make sense of this data. And we end up with what is called and what is very much accepted, which is very much practiced in our field, Seurat object. This is basically a gigantic dictionary, that happens to be encoded using the R language and is accessible through specialized R library.
[00:09:09] Unknown:
As to the types of questions that you're trying to answer from this data, I'm curious if you can give some examples of the ways that you need to be able to manipulate this raw information to be able to answer these questions and some of the derivations that you're looking to be able to provide to some of the downstream consumers and maybe talk a bit about who the consumers are of this processed
[00:09:37] Unknown:
data. I'm going to talk about 3 main users that we have. 2 of them are more in the biology, side of things. And then we have the machine learning engineers that are looking into the data. We'll all start with the immunologists. They're very much concerned about, high level questions about the biology. For instance, they wanna design an experiment that would help them figure out which drugs can they match a certain disease. Like, which drug would be the most effective on a specific patient with a specific genome profile. And what immunologists are interested in is looking into large datasets and kind of like finding the short list of those drugs that would not destroy a certain disease. And for that, I think that just a fairly simple data repository does the trick there. Where it's become a little bit more complicated is when you're actually talking to a different set of user, a different class of user. These are the computational biologists algorithmic and the big data side.
And computational biologists are very much interested in performing deeper analysis and they can ask questions such as, in which disease can I find a gene that is overexpressed? We can find the RNA sequencing at a larger amount than normal. And in order to be able to perform such an analysis, obviously, they need to draw large amounts of data and start comparing between different datasets. And for that, obviously, we need our data warehousing and downstream analysis to run-in place. And for the 3rd class, our machine learning engineers, the use case there is a little bit different.
They're less interested in looking at specific studies, but they're looking at a cross study or a cross project, cross dataset analysis where they actually use our data mart to filter and generate specific datasets so they can answer questions such as how do I assign a specific cell type label to a cell that I'm seeing in the sample? Or they would look at our data and they would try to predict basically what would happen if I would knock out a gene, if I would change a specific gene in a specific location. What would happen to the cell? Would it impact it in a way that can actually trigger an immune response or cannot? And for that, as I mentioned, they would require large data. That's part of the reason why Immuni was established and actually decided that we're gonna be able to cover the entire gamut from sample and sequence generation all the way up to analysis.
So we could actually control the process of generating those datasets.
[00:12:30] Unknown:
As far as being able to manage the engineering around working with all of these datasets and working with processing it and providing an appropriate sort of representation to these different stakeholders. I imagine that that requires a certain amount of domain expertise among the engineering group to be able to know, you know, this transformation is valid given this context. But if I, you know, happen to manipulate this source format in such a way that these sequences of letters get transposed, then it's going to be completely garbage. And so I'm curious how you've approached that aspect of bringing in the appropriate domain expertise without necessarily having to have everybody on your team be an expert immunology or single cell genomics?
[00:13:17] Unknown:
That's an excellent question. As I mentioned earlier, there's 2 major types of biologists, within the team. We've got the immunologists, which are domain experts on the immune system. And we've got computational biologists, which are domain experts basically, or which are kind of like a bridge the gap between the biology side and basically working with algorithms and big data. And then we have the software engineers, which basically build the whole data infrastructure and put it in place. And these 3 classes of experts obviously need to communicate and work together to build up the tools that we use at Immuni.
And in order to do that, you have to maintain a certain degree of both integration of the different disciplines while keeping a good separation that would allow each of those disciplines to operate really. And the way that we did this, and I have a good I think I have a good example to demonstrate this, is within our data processing pipeline that basically combines sequencing data that needs to be parsed and analyzed eventually by immunologists, and also need to be processed by algorithms that the computational biologists are developing.
And also needs to be assessed for quality by the computational biologists that are looking into this data. And then the infrastructure need to be built by software engineers, as I mentioned earlier. To that effect, we first created a team that is very heterogeneous. We have a team that is composed of all 3 classes of experts. But on the technical level, we're using tools that allow us a certain degree of separation between, for instance, the orchestration and actually the algorithmic layer and the analysis layer. What is interesting, by the way, is that while the infrastructure itself is built in Python, all the algorithmic and analysis layer at Ini and I are built using the R, the R language, which is specifically suited for analyzing single cell data. This is in itself obviously poses a challenge for building big data analysis systems.
The toll on actually standing a big scale and robust data system is pretty obvious there. But we've actually adapted to this situation when we've actually built really nice solutions that maintain keeping track of, for instance, like, version and compatibility between the 2 different languages. We've built nice bridges between the Python and our side. We've actually set up the software development cycle such that computational biologists could go in and develop their own specific r algorithm without actually being concerned with everything that has to do with the orchestration, with monitoring, with logging and everything that has to do with running the analytical piece.
[00:16:02] Unknown:
For that communication at the language level, are you able to take advantage of the arrow format for being able to hand off the in memory data structures? Or are you doing it where you're maybe serializing over the wire for being able to send information to the r runtime and then being able to shuttle it back to the Python
[00:16:26] Unknown:
layer? Yeah. That's exactly right. We have basically a broker library that that's shuttling data back and forth between Python and R. We're also we're using a mechanism that allows us to propagate any feedback coming from, from Doctor back to the Python, and we could actually show it on our orchestrator.
[00:16:44] Unknown:
As far as the challenges that are unique to biological interesting examples of complexities that you've had to deal with because of a lack of kind of generalized support in the tooling where if you're working with clickstream of data, you can easily just put it into Kafka and process it with Spark or something. But if you're working with, you know, this genomics data, you might not be able to have as much in the open source ecosystem to pull from and curious how that has manifested as far as where you've invested your engineering time and how that boils down in the build versus buy decision.
[00:17:27] Unknown:
By the way, Pritas, and and say that because of my, previous life at the Columbia Genome Center, working with big data and kind of, like, inventing the wheel or reinventing the wheel back then, we've seen a lot of the tools or the type of tools that we use today for streaming and orchestration actually being built at genomics labs just because there wasn't anything on the shelf back then. So a lot of those solutions can trace or some of those solutions can trace their origins to the genomics labs. Unfortunately, the reverse did not happen. So we didn't have, like, the general purpose tool becoming good solutions for biology.
We do know that many vendors have tried to use kind of, like, genomics to kind of, like, present use cases in which they're kind of, like, showing the capability and dexterity of, for instance, like, their cloud services or their warehousing capabilities. But we haven't seen something that is really tailor made for genomics that is actually able to encompass the need in genomics. And therefore, there's a lot of effort on our side to cover this gap. I can give 1 example, for instance, for a specific algorithm, core algorithms in our processing pipeline where we actually feeding a set of raw data or in a sequencing. And we're trying to basically base on this sequence, DNA sequence, for instance, we're trying to get the label of the gene. We're trying to understand which gene we're looking at. And the algorithm itself is based on clustering, in which we're looking at this long genomic sequence and using clustering algorithms, we identify the closest the closest patch. And there you go in getting the gene name out of that. However, it does a whole lot more than just labeling. As I said, there's a lot of, like, data cleaning and a lot of gymnastics that are being done with this data. And this entire entire algorithm is being developed and maintained by a single vendor.
And that specific vendor has a sequencing operation as well. And obviously, the software does play a big role in their offering, but it's not their major or core business. So while there are putting a lot of attention into developing this tool, it still lacks and it still does not fully advanced and fully capable as you would find similar solutions in different verticals. To add on to this, there's not too many competitors in the market for this specific capability, which basically doesn't leave a whole lot of room of actually deciding which algorithms you're gonna be using. You're kinda like locked into this specific algorithm because you're using the sequencing format that this specific vendor is using and because there's not a whole lot of other opportunities in the market. So again, this basically the mitigation the mitigation there is that we either build extensions around it.
If we do get access to the source or we build some kind of wrappers around it to add to the capabilities and we're trying to do the most out of it. 1 of the interesting thing is that it does get the team much more challenged and come up with a more creative solutions than with the usual tools. So the team is less spoiled.
[00:20:32] Unknown:
Today's episode is sponsored by prophecy dot I o, the low code data engineering platform for the cloud. Prophecy provides an easy to use visual interface to design and to deploy data pipelines on Apache Spark and Apache Airflow. Now all the data users can use software engineering best practices. Git tests and continuous deployment with a simple to use visual designer. How does it work? You visually design the pipelines and Prophecy generates clean spark code with tests and stores it in version control, then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage.
Finally, if you have existing workflows in AB Initio, Informatica, or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy, making them run productively on Spark. Learn more at dataengineeringpodcast.com/prophecy. And I know that the overall field of bioinformatics has been growing a lot in recent years and that a lot of that has happened in the open source ecosystem. I'm curious if there have been any kind of emergent standards that have grown out of that growth and some of the ways that your experience at Immuni maps to the ways that the broader community is starting to approach these problems and some of the ideas that you have benefited from and some of the ways that you think the broader community might be able to improve or ways that you might be able to kind of refactor your overall approach to take advantage of some of these community supported tools or platforms?
[00:22:12] Unknown:
You're absolutely right. There's a whole host of open source software and open source tooling that is offered not only at Immuni, but also at the Columbia Genome Center. We've looked into those tools, and some of those tools are actually kind of like the bedrock of the science itself. For instance, as I mentioned earlier, the analytics object that we're working on, which is named Seurat, this is the main object that represents, single cell data within our field. There's an open source library that allows you to both generate access and modify it. There's a big community that is built around it. Obviously, it comes with the weakness or the weakness, but it was developed in R because it was developed in R. There's a whole host of issues of basically how do we productionize doing analytics over the subject?
I should add something that, at Amy and I, 1 of our main occupation is actually industrializing single cell data. So we don't just look at 1 studies. We actually have a well functioning, well oiled machine to generate and churn out single cell data at scales that are not being done almost anywhere else. And to do that, obviously, we need to do a lot of automation. And we need to basically solve a lot of problems that, in the community itself, researchers are not really facing because they're working with 1 SIRADA object. So they're working with 1 study. Whereas we work with multiple that obviously bring the question of scales, heightens the question of variety of data and and managing this data. And in that sense, it's basically on us to start building the tools around it. In that sense also, those solutions are not available in the open source community.
[00:23:54] Unknown:
Another aspect of what you're doing is, as you mentioned, you have these multiple domain experts within your team, and then you're also interfacing with people who are doing active research, people who are trying to do analytical applications, people who are trying to build machine learning models. And I'm wondering if you can talk to some of the ways that you've architected your overall platform for being able to service all of these different use cases and some of the ways that you're able to converge in terms of the data outputs that can provide the utility to all these end users and some of the ways that you've had to specialize some of the output to be able to fit those different use cases?
[00:24:37] Unknown:
Let me say a couple words about the architecture of the platform in general. I guess like 3 or 4 main areas for the platform. We mentioned earlier, we have the area of data processing. Then we have the area of data cataloging, which I'll touch on in a second, then we have the warehousing. And on top of all of this, we have the analytical applications. This conceptual separation already allows us to build the platform and then make it such that it could serve the different use cases. However, there's many complexities emanates from basically having so many different disciplines and so many use cases that are trying to look at the data, investigate the data from many different angles.
So I mentioned earlier that we have those 2 classes of biologists. We have the immunologist. We have the computational biologist, but I kind of, like, oversimplified. So, obviously, within the immunologist community, we have different types of immunologists that are asking fairly different questions. And within the computational biologists, we have different types of computational biologists that are looking to do different things, and those present different use cases. In order to be able to address those different use cases, we've converged on the minimal set of data that we need to to process in a single way and assess its quality level and bring it to a quality level.
We cannot continue and analyze it. And that's kind of like provides the core of our data repository. So this is what we call the annotated multi and multiomics immune cell Atlas or the abbreviation is AMICA. That's our main data repository. And this is the sum of the data that has been aggregated at Immuni and has been assessed to reach a certain quality level that can be consumed for analytics and other use cases. Now the maker represents the data that we generate at Immuni. The core of Mika's represents the data that we generate at Immuni. However, we can enrich this data from public sources and other sources and other databases to help the analytics process.
And this requires that we have a whole layer of data integration that requires another domain expert. These are the bio curators. These are experts that can help us figure out questions such as standardizations around gene names. So gene names, while there is some standards for naming genes in central databases, Obviously, those standards cannot be enforced, and therefore, there could be many different aliases and synonyms for this unit of analysis that is called the gene. For clinical data, we can process samples that are being defined by the clinical experts in 1 way and could be defined in a different way by other experts pending the disease area or pending the treatments that are being administered.
And for that, in order to actually mold or meld everything together and bring it to the point that it could be analyzed, we need the help of bio curators who build ontologies that help map this data to 1 standard that we maintain at Immuni. So by taking Amica and taking the standardized data that the bio curators bring in, we can actually expose the data and make it available to the different use cases. There's many different requirements for the way that we actually do a computation. So as I mentioned earlier, we can serve a very simple use case of basically doing a simple search over data repository. And that's basically just picking out and filtering some metadata churn, matching them up against the data warehouse, and doing some aggregate calculation on the top of that.
There's the more complicated part where we are actually starting to look into large of what we call gene expression. That's basically what is the amount of RNA sequences that we find in a specific cell. And if you want to start comparing between the gene expression of 1 study versus the gene expression of another study or maybe we have 1 study that has like different conditions that are being administered in this study and we want to compare between the 2, then basically in that sense we are required to do heavy computation that we run on top of our data warehouse, which is basically kind of like a 2 step process.
We do require to do some heavy lifting and filtering on the data warehouse and then feed over to the downstream analytics computation that happens to run also in R. This computation being done basically on the data frame that is exported from the data warehouse.
[00:29:23] Unknown:
As far as the requirements that you have in terms of timeliness or latency, I'm curious what you're working with there and the frequency of updates that you're getting from your source systems and some of the just latencies or some of the timeliness that you need to consider and the end to end flow from collecting the source data through to delivering it to somebody who wants to process it in their machine learning model or be able to build some analysis or do some ad hoc research on it? We're a batch processing operation,
[00:30:00] Unknown:
and we're working with sequencing data that can take up to 10 days on the sequencing side. Where the data engineering effort comes into play and where it actually adds a lot of value is by creating this automation. This automation that actually takes this raw data and churn out the best quality data that we can get out of this raw sequencing. In that sense, well, the time frequencies the time intervals that we need to work with need to be within a couple of days turnover, we're not really required to have like data available instantaneously.
[00:30:33] Unknown:
Because of the fact that the sequencing process and just the overall time frame of being able to generate new datasets, I'm curious how that factors into the work that you're doing on the data engineering side, where is there anything that you've had to do with your data providers to say, you know, this is the optimal output format that you can produce that makes it easier for me to consume it or some of the quality checks that you're able to execute early in the process or if you're able to get any sort of intermediate results from that sequencing to be able to understand, you know, this is a good run. We're going to let it go through to completion or, you know, there are some errors that are happening in the data generation. So we're going to abort and start over so that we don't have to waste the entire 10 day cycle?
[00:31:22] Unknown:
Right. So QC, quality control, is an integral part of our processing operation, And we never throw out, like, complete batches of data, but we could mark a specific chunk of the data at a specific quality level that we may decide not to run analytics or may not be good enough for a specific study. What is very interesting is that when we run both the experiment, when we design and control our own experiments, and we can actually measure the quality of the data on the computational side, we could always inform the experimental side either on how we can improve for the next time or basically if a certain sample was not processed up to quality, we could always restart with new configuration and improve the quality.
So the very interesting cycle of actually being able to control both sides, both the experimental and the computational side, and making sure that the experimental side is very much data informed. It's not that we are just running the experiment and be done with it and let somebody else analyze the data. We have the full circle there that we can actually run, process, and then inform whether or not we want to improve the quality.
[00:32:35] Unknown:
As far as the overall life cycle of data and some of the ways that the kind of data generation process happens. You mentioned that it's data informed, and so you have researchers who are analyzing the outputs of previous runs, and then they decide what is the next batch of data that I want to generate and analyze. And I'm just curious if you can talk to some of the overall kind of collaboration and communication aspects of what you've built to work with these researchers to help them understand, you know, this is an area that's worth exploring more or how they might talk to you to understand what are some of the ways that I should be thinking about structuring this experiment to be able to get the most useful data out of it and just that overall workflow and life cycle?
[00:33:24] Unknown:
We get a pretty good signal. For instance, when we do the quality control analysis, we get a pretty good signal as to where in the experimental side we could have, like, done things differently to get a better quality. But not just on the experimental side, we can also look at what the ML folks are getting from us. So for instance, 1 of the denoising efforts that we do at Timmy and I is what is called removing doublets. Sometimes in the sequencing process, you can see the same cell twice. You can count the same cell twice. That is a very common phenomenon in single cell sequencing.
To compensate for that and to correct for that, we're using a set of analytics, including machine learning algorithms that actually identify those cells that are too similar and that we're counting them twice. 1 of the nice things that we can do at Immuni is actually generate datasets for our machine learning experts that we capture those doublets, that we capture those cells that we notice. We can label them as doublets. And then the machine learning experts can actually train and improve the algorithms that identify those doublets. And in turn, basically, we're getting a much better process of actually cleaning out and denoising this this data.
[00:34:44] Unknown:
In terms of the ways that the data that you're generating is being applied or some of the workflows that you've built at Immuni or some of the specific technical complexities you've had to deal with. I'm just wondering what are some of the most interesting or innovative or unexpected either applications of your platform that you've seen or solutions that you have seen built internally to address this problem space?
[00:35:09] Unknown:
There's 2 that I think I can mention here that would be interesting. On the engineering perspective, being able to harmonize or make sure that both the production codes, which is written in Python, and the research code that is written in our coexist in our ecosystem has required us to be very innovative in the way that not only the way that we engineer our systems or the architecture for the system, but also on the DevOps side. 1 solution that we came up with, maybe I should first address the problem, is that when you're actually building a large data engineering system that evolves over time, the integral part of the system is based on R, you run into a lot of compatibility issues. R is a very nice language.
If you have it contained to your own workstation and you are basically working on analysis tasks. But when you're trying to stay at a large system and that system continuously evolves, then R is a bit lacking on that because R is not really managing versions in the way that we used to manage versions. We built a pretty innovative solution around that using some date snapshot system in which we kind of like reach out to some repositories that actually keep track of the different versions of different, R packages given a specific date. And we have to match that with the state of the operating system and the system libraries that supported that.
And together, combining this configuration, combining the fact that we were able to pick out a specific R library from a specific date and the supporting system libraries and system packages, we are able to work to get a solution that would actually guarantee to work at any given time and be compatible with other libraries. This has been an an innovative solution that to come up with out of necessity. That's on the engineering and the DevOps level. On the ML level, something that we do that is very unique to Immuni is the ability to project from the large scale data that we generate, project from this larger distribution of cells, we can project to more rare examples.
So obviously, rare, cell types are more common. We can profile them more easily, but for those cell types that are quite rare, it's harder to actually profile them. We use advanced machine learning to be able to to predict basically how those cells would behave under specific conditions.
[00:37:38] Unknown:
And in your own experience of building out this system and working in this problem space, what are some of the most interesting or unexpected or challenging lessons that you've learned?
[00:37:47] Unknown:
The industrialization of single cell genomics has basically posed many challenges for us. I mentioned the fact that we had to address coexisting and production language, Python in that sense. That we had to take different disciplines and work together in order to build those solutions, that we had to make sure that they're all communicating the the same language. And they're all being able to see and understand the computational processes that we're running. And in that sense, I did not mention, but the orchestration tool that we are using is Daxter developed by MENTL. 1 of the reasons that we've decided to use Dagster as a core platform in our stack is because we really like the capability of Dagster.
Dagster, through its very nice UI, is able to make very explicit the architecture and the process and the workflow of data processing, not only for the software engineer but also to the computational biologists, the immunologist, and the ML person so we can all together, like, huddle around that processing workflow and make tweaks to it and decide on, like, how do we want to run this specific dataset? How do we wanna, like, tweak the different parameters in the way that we actually wanna generate the features for our next ML model? Making it explicit and putting this architecture in front of all of the constituents, in front of, like, all of the experts has made a significant improvement in the way that we communicate and the way that we collaborate at MNI.
[00:39:18] Unknown:
As you continue to build out the platform and work with your stakeholders and evolve the platform, I'm curious, what are some of the things that you have planned for the future of Immuni or some of the ways that the evolution of the data ecosystem and the bioinformatics space are
[00:39:35] Unknown:
influencing the way that you think about this problem or introducing new tools that you're excited to explore and possibly incorporate into the work that you're doing? 1 of the nice thing at Immunize that, man, I'm not gonna be saying is a slogan. That's really how we were operating. We're operating in a cutting edge field, and we have to experiment with cutting edge technology and we have worked with basically Any new innovation that is coming either from academia or from industry or you know, obviously a homegrown as well because we're still in the discovery phase of what we can do with this type of data.
And going forward, the future of Immuni is really about generating more data, integrating public data. We've recently acquired a Swiss company that is aggregating all the single cell data that is available in the public domain. What they're able to do is ingest studies that are being done in research institutions and being published. They're ingesting it and curating it and aggregating into the largest public data repository for single cell data. So we're in the process of actually integrating this repository into our operations so our data community could actually tap into this resource and use it and combine it together with the proprietary data that we generate in house.
So, yeah. The future is basically, more data, more public data, more sources, new different type of assays, different type of experiments that we are running, Probably many new algorithms that we're gonna be adding into denoising and increasing the quality of our data. And definitely more
[00:41:11] Unknown:
machine learning and data modeling being done to generate more insights of this data as well. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:41:31] Unknown:
I think that the bioinformatics tooling is still very much lacking from the data engineering community. There hasn't been a whole lot of attention being devoted to the needs of the genomics community. I know that the perception is that there's a lot being done, but I think that, you know, speaking from kind of like inside, I can tell you that we're still adapting general purpose tools to our specific needs.
[00:41:55] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Immuni. It's definitely a very interesting company and an interesting problem space, and I think that you've built a very impressive engineering capacity to support it. So I appreciate all of the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Thanks, Atobias. Thanks for having me. For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, the innovative ways it is being used.
And visit the site at data engineering podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Guest Introduction
Guy Yachtav's Background in Data Science and Genomics
Overview of Immuni and Its Mission
Types of Biological Data and Challenges
Users and Use Cases of Immuni's Data
Engineering and Domain Expertise at Immuni
Open Source Tools and Community in Bioinformatics
Platform Architecture and Data Processing
Data Timeliness and Quality Control
Collaboration and Communication with Researchers
Innovative Solutions and Challenges
Future Plans and Industry Trends
Closing Remarks