An Exploration Of The Data Engineering Requirements For Bioinformatics

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Outland as an internal tool for themselves.

Outland is a collaborative workspace for data driven teams like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create a single source of truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth

$3, 000 on an annual subscription. When

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's

l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Jillian Rowe about data engineering

practices for bioinformatics

projects. So Jillian, can you start by introducing yourself? Sure. Hello, and thank you for having me on the show. So like you said, I'm Jillian Rowe. I work in bioinformatics.

I spent about 10 years in academic research where I was mainly focused on in house high performance compute

clusters and optimizing analyses for those clusters.

And a few years ago, I went freelance and now I do very much of the same thing except that I work with startups as they are kind

of working through their startup life cycle and as they are kind of moving forward and developing their analyses and their own infrastructure

and how they're gonna build out their high performance compute infrastructure, but on the cloud with AWS.

Do you remember how you first got involved in the area of data management?

I was in my first job,

and, like, just kinda over time, I got more and more interested in kind of the programming side of things. So, actually, I started off as a data analyst. My background is actually in 1st in computational neuroscience and then in bioinformatics. So when I graduated,

you know, I went out and got a job as a data analyst as 1 does. And just over time, I found myself getting more and more interested in kinda the programming aspect and being able to automate certain tasks. So I would kinda run around and I would see what, like, what people in the lab were doing. And then sometimes I would see them doing things very manually, and I was like, oh, I could write a macro in Excel or like a Perl script at that time

to go and do that. And then kind of over time, people were coming to me for these sort of tasks, which wound up being really interesting for me because as it turns out, I don't really care that much about the tech. I really just enjoy talking to people about their problems and what kind of problems they're going to solve, in particular with biology, just because I've always loved science and I've always loved biology.

So over time, I got more and more into that. And then

at the time, I was working for Weill Cornell Medical College, and I was in their bioinformatics

core.

And this is when research and everything was really just starting up there.

And 1 of the tasks that we were doing was we were trying to move over from having each 1 of the labs kind of having their own

server environment that everybody manage themselves with sort of varying degrees of success to moving over to, like, a managed high performance compute cluster where you have,

you know, much more robust, much more kind of centralized and managed resources. So for example,

you can't just do apt get install

software. You have to compile everything from source. You put it in a system called modules. That's now, you know, if you've used something like pyenv or conda, the conda package manager, you'll be quite familiar with that concept.

So I got kind of over time more and more into that. And then

1 of the things that came up, especially in bioinformatics,

is that the resolution and the size of our data is just constantly increasing. Every few years, a new microscope or a new sequencer will come out. And each time that happens, the data is on it's not even a linear curve. It's like an exponential curve in terms of how much the data increases.

So essentially, what was happening was that every time there was a new sequencer, a new microscope,

the scientists' analyses would stop working, not because, you know, the analysis itself was incorrect, but because, you know, they were running out of resources. They were hitting memory and CPU requirements, or they weren't able to run it in a time scale,

you know, that was reasonable for them to go on to their next funding round. So I started to get really interested in this idea

of, you know, you have your data and then you implement this kind of essentially like a scatter gather sort of approach, and you do that with a high performance compute scheduler.

But, you know, of course, you know, you can't just start there. You have to go back and actually really get into the data, see how it was processed, how it came to be on the cluster, make sure it's all correct and all that sort of thing.

So I was interested in doing these kind of high performance compute analyses. And to do that, I had to go back and take care of some of the data management aspects as well. And you alluded to how you got involved in the field of bioinformatics

as well, but I don't know if you want to speak a bit more to your sort of focus on that particular area and problem domain. I don't know. I've just always really loved science ever since I was a little kid. Like, I was a little girl who would run around pretending that I was Jane Goodall.

You know? Like, I grew up in New England as we were discussing earlier, and I used to just love to run around outside and I would build myself little, like, terrariums out of soda bottles and I would, you know, like, put, like, caterpillars and sticks in there and make them an ecosystem

for the day and only for the day because otherwise my mother would have been very upset if I brought bugs into the house. You know, that was summertime before we had the internet and that was what I used to do And then when it was time to go to college, I was, like, well, you know, of course I'm gonna go study biology and so I started off studying biology but as it turns out, I hated the labs. Like, I hate and I was so bad at them. I was so terrible at them. And if I never have to do another titration as long as I live, like, I think I'm gonna die happy. So, you know, so I kinda got into that and then I was like, oh, what am I gonna do?

And there was another major, you know, luckily called computational neuroscience and not about robots, which I thought was equally cool. So I switched on over to computational neuroscience

and that was really my first introduction

to research and to academic research because I took a programming class. It was something, you know, like programming for research or, you know, what we would now call data science.

And the teacher and the TA were both, you know, were actual researchers and so they would spend a lot of the class kinda talking about that. Our TA did, like, very little TA ing. It was more, you know, like, emotional support for the problems that he would run into building his robots, and he would show us, like, little videos of the robots.

And so clearly, I just experienced that and was like, well, you know, this this is what I wanna be doing. I wanna be in research. So I kinda carried on with that And then I worked in computational neuroscience for a bit. I worked

in some language of cognition labs. I worked with autistic children. I, you know, I was really into the cognition of language for a little bit. And then kind of as time went on, I was starting to think about grad school

and I didn't really wanna do a PhD program. And I wound up finding what looked like a really interesting master's degree in bioinformatics,

and I thought that would be kind of an interesting chance for me to get back to bioinformatics or back to biology rather. Because at the time when I'd switched to neuroscience, I didn't even know bioinformatics was, like, a thing. This was, you know, longer ago than I'm really willing to admit. So then, you know, bioinformatics became this cool new term and I got to go get started with that and I picked up some more programming courses

and started to learn about really kind of the computational side of biology, which I had sort of never really known before, how DNA and genes are can essentially be encoded as data

and how we can take that data and analyze it. You can build, you know, phylogenetic trees to study evolution.

There's all kinds of cool things that you can do. Pearl save the human genome project. That's a really interesting article that I kinda recommend to everybody who wants to sort of know about, you know, maybe some of the history behind, you know, how we even sort of think about kind of modern day data science and things. Yeah. So then I finished my master's master's degree and

I got a job in a bioinformatics

core and then as a data analyst and like I said, I kinda got, you know, farther and farther down the path where I became more and more interested in this idea of high performance computing,

but essentially as it applies to bioinformatics.

Speaking more to the specifics of the data needs in bioinformatics,

you mentioned how the evolution

of the

equipment that they're using for being able to source the data and analyze it is 1 of the things that contributes to some of the challenges that arise. But what are some of the other

aspects of bioinformatics and the ways that they collect and use data that make it unique in terms of some of the data engineering challenges that they face? I would say 1 really interesting aspect is that

it's very natural data,

and I would say,

you know, 1 thing about that is that it does not lend itself well to a relational database is something that's been found. So over time, people have been continually

trying to create these data models for biological data, and they were kind of trying to force it a bit into relational databases.

And that was you know, I mean, it it was what we had and we used it and we all just kind of moved on with our lives. You know, it wasn't intuitive for pretty much anybody involved, anybody writing the data models or using the data models.

And then I think over time, people started to get more into these kind of document databases.

And then 1 of the more interesting things is that the community as a whole has really kind of grasped onto this idea of using graph databases,

you know, like, especially the 1 that was the Neo 4J. I think that's the 1 that Facebook kind of architected and open sourced.

And that is much more,

I suppose, natural and expressive for biological data because also biological data tends to be, you know, like a network of graphs as well. As far as the

challenges that you have been working through and particularly in your role as a consultant working with different teams who are trying to bootstrap their

operations and their technology stack for building these bioinformatics

projects or, you know, working in the life sciences, what are some of the common problems that you find yourself solving on a semi regular basis?

So I would say the most common problem is being able to boot up an infrastructure that's actually scalable, that's usable, that doesn't involve the data scientists or researchers spending,

you know, entirely too much time kind of clicking around in the AWS console and trying to figure things out.

So now 1 of my primary tasks is that I go and I work with startups as they're going through kind of this evolution of creating their startup, which has been really, really interesting to be able to see these different companies kind of grow. And in particular, I'm working with a lot of companies

that are in this space called personalized genomics or personalized medicine.

And the idea behind that is say, you know, I'm in sort of an unfortunate position of having cancer.

These days, I would go to a hospital, they would immediately test, you know, my genetic data, metabolomic

data, just, you know, like, omic data that they could possibly get. They would get from me. They would send it off to potentially 1 of these companies that I'm even working with.

And what these companies are doing is that, generally, they are researchers, and they have some very, very specific knowledge on, say,

specific biological pathway or family of genes that is very involved in, say, a particular type of cancer, like maybe, you know, like, leukemia

or something like that. And they're able to build, you know, like models and knowledge bases

and hone all this information from, you know, data that's previously been collected before and studied, and then also your own genetic data. And with that, they're able to combine these datasets

and actually predict how well you might react to certain types of medication. So, you know, that can really inform your doctor's choices on maybe which type of chemotherapy you do or, you know, how well you're gonna respond to maybe certain types of radiation or things. So that's been very interesting. And then what happens there is that these data science companies, you know, like I said, they tend to be founded by researchers who have a very specific knowledge set in some very, very niche area. And so initially when they go out, some of them are bootstrapped and some of them get funding, but it tends to start very much on, like, a research and analysis space where they're going sometimes they're even collecting their own data or sometimes they're just going and mining for data that's publicly available. You know, particularly in cancer, there are many, many, many data sets that are already, you know, like, freely and publicly available because that's an area of public research.

And then they go and they start doing their own research, you know, like, around this particular topic that they're trying to study.

And they very quickly, even with just their initial research, tend to hit a wall with the amount of kinda, you know, DevOps or data engineering that they're really able to manage on their own, particularly because that's not where their skill set is. That's not where, you know, even if you just wanted to talk about where the money is in terms of their startup, their startup is making money by developing,

say, a set of intellectual property around

a particular type of treatment or a particular type of cancer. So there's no real reason for them to be doing that. And then they'll hire me or somebody like me to come in and start to deploy

a much more scalable auto scaling infrastructure for them. And, generally, on kind of the first phase, that tends to start off with a high performance compute cluster.

And normally, I build that on AWS using a tool called AWS ParallelCluster,

which is actually supported by AWS itself.

It's a really great tool, you know, so

kind of intuitive to go in there and bootstrap. Once the cluster itself is set up, it pretty much just requires kind of basic Linux administration skills to do that. And once you have an HPC scheduler set up, you know, you can really start to go to town with deploying these models and doing kind of grid search parameters

on your different data sets and really going in there

and mining the data. So then normally kind of the startups are on this first step where they're still doing their analysis, they're still doing their research, but at some point they get, you know, they're like, okay, we have something. This is it. And then they have to kind of be able to show

sort of the research and their results and kind of show it to the world. Sometimes, you know, they're showing it to hospitals. Sometimes they're showing it kind of VC funding agencies. Quite often, they're showing it to pharmaceutical companies and presenting it as sort of, like, okay, we have this kind of intellectual requirements for how these datasets are

for how these datasets are presented has gotten higher and higher, which is also a very interesting point to me because it requires, you know, these very scalable and very interactive data visualizations that we just didn't used to have. Like, when I first started on HPC,

it was the scientists would come to me with their analysis, and I would make, like, a PDF in our markdown with, like, different you know, with, like, their different parameters. And maybe that

would take, like, a week to run, and I would hand it back to them. So you can imagine this kind of process would take,

you know, weeks, months. I mean, you know, there are projects that went on for years where people were, you know, constantly fine tuning,

you know, their results and the visualizations.

And that is no more. You know, the companies and things, they really expect to be able to go to a web browser and really, you know, sit there and click on this and click on this and, you know, move little sliders around and things like that. And then we get back to this problem of the datasets are so big. And,

you know, until recently, most modern web applications were they were simply not made to be able to scale this amount of data and read it into,

you know, a very interactive visualization. It was something like, you know, you would throw it on the HPC and

hope that you could create, like, a nice plot out of it. Now, you know, so many companies have kind of, you know, risen to that challenge. We have some really interesting libraries out there. We have what is it called? Like, data shader, I think is a really nice

plotting library that allows you to do these things. You know, we have R Shiny and Dash to be able that have really kind of catered themselves to the data science ecosystem.

We have, you know, a lot more file types that allowed us to lazily load data. I think my favorite 1 lately has been

using the Apache Parquet

file type and using that with a DAS cluster to kind of on demand, read in your data,

analyze it, crunch your numbers. And then once you have that, you can deliver it back to the browser and you can do that in a very, you know, interactive fashion.

I mean, really no matter how big your dataset is, because even

behind the scene, your compute cluster is auto scaling at the same time. You know? So, again, we have the kind of the analysis, the high performance compute infrastructure. And then on top of that, we have this kind of just sort of I throw it in a bucket of we're doing, like, application development to showcase

some of your intellectual property or some of the research that you've done. And even that needs to be auto scaling now. You know, it's not enough to throw a PDF at the pharmaceutical companies and hope they're gonna give you money that's you know, those days are over. So I also find that to be, like, a really interesting area

of, I suppose, you know, research within DevOps or within data engineering is being able to go through and architect the back end for these very highly available, very scalable,

you know, data visualization

applications.

And that involves everything from, you know, the formats that you use to actually store your data. Are you using, you know, like a CSV file that has a lot of overhead, particularly if it's, you know, like a very large dataset?

Are you using a Parquet or an HDF5?

Are you using something, you know, like Dask or Apache Spark to load your data and lazily and be able to do computation

very spread out? You know, these are all questions you have to start to ask yourself as you're kind of progressing with your data management and analysis and research needs.

And an interesting aspect of the current state of the industry is that there do seem to be a lot of new companies coming out in the life sciences space because of some of these capabilities

of more available compute, better data analysis tools.

And I'm wondering what are some of the sort of overall impacts and trends that you see as being contributing factors and potentially some of the

technologies that you anticipate seeing more investment

freelance 2 years ago, but I do feel like you're seeing a lot more, like, freelance consultants and freelance research scientists who are like,

hey, I can go found my own startup by throwing, like, a couple $100 at AWS, even,

you know, like a lot of the cloud platforms like AWS and Google, they will give you some startup money as well. Got

like $

got, like, $10 for having a really sketchy looking website. So, you know, you too can go do that. So I'm definitely seeing a lot more of that, which I think is really interesting because 1 of the

blessings and curses, I suppose, of life sciences

is that, you know, is the data gets so deep. We talk about, like, bioinformatics.

It's a thing.

It's not. Like, within bioinformatics,

you have so many particular types of research going on. You know, even with genomics, you have clinical genomics and statistical genomics. And then within clinical genomics, you have different people studying, you know, clinical genomics, you have different people studying, you know, like, RNAs and coding and noncoding. And so, you know, you just get

get so deep and so niched

so fast that it's impossible

for,

you know, any 1 person to know everything. You know, you have these scientists who have this, you know, extremely niche and extremely specialized knowledge and, you know, they could very well be the most specialized person in the world on this particular topic. They might go to conferences with like 10 other people or something like that.

You know, I think that's becoming more and more common. And I think because you have these scientists and they have this kind of knowledge that maybe very few other people have on this 1 particular area of study, it's becoming more and more important

to actually be able to go and talk to those people, which has always been another very interesting

problem that I've been involved in in data science too is kind of bridging the gap between the scientists and between the IT support staff, I suppose, in particular, and getting them all to talk to each other and understand 1 another's problems.

So I think in particular,

you know, like the more tools that we can put in the hands of scientists, the more interesting research that's going to be done. And so I think that's why there's this really big push lately among usability.

I think in particular, you know, like the data science, especially the high data community has done such a good job of this. You know, we have NumPIs who I don't have to remember matrix algebra.

We have, you know, DASK so I don't have to write MPI and Pandas so that nobody has to parse CSV files anymore.

And I would say it's really fair

to say that, you know, that scientists can use those without being software engineers. And I think the next hurdle

is going to be getting that also on the DevOps side on, you know, the computational infrastructure side on getting scientists

to, you know, spin up their own high performance computing clusters and batch environments and, you know, airflow production models and all this kind of thing. Another interesting aspect of the life sciences and bioinformatics

sector is that it seems that the practitioners in that area are much more interested

in and willing to do a lot of the data preparation versus

in so more of the kind of, I guess, standard fare, if you will, data scientists or data analysts who want all of their data to be precleaned for them so that they can start working on building the model. And I'm wondering if that is

something that you have seen in your work and how that influences

the

types of technologies that you're bringing in to do some of the data engineering aspects of the information that is being analyzed in the life sciences?

Oh, yeah. Absolutely. I would say overall, you know, bioinformaticians

or clinicians or genomicists or kinda whatever title they give themselves in that space

really take a lot of ownership of their data and they tend to, like, really care about it, I've found. Biology is not always the most lucrative field, so, you know, so you tend to have a lot of people who are in there, like really, because they really, really care about it. You know, they want to get into their own data. They want to explore it. They enjoy, you know, the research aspect of it. And so, yeah, again, I mean, I found just the more tools you can put in their hands that they can use is always very good things. They're always very appreciative of it. You know, they always wanna use these tools. For example, I was on a project recently

where they were analyzing some data and it was essentially tabular data. So I wound up throwing an Apache Superset,

you know, web application and superset is kind of, I suppose it's like a bit like a web based IDE for databases and tables, and it can also make you like a lot of very nice plots. But essentially, it's a great tool to be able to hand to data scientists and say, here, if we have your data in a database or even in the CSV file, you can analyze your own data. And so I was talking to a variant scientist, somebody who studies genomics data,

and, like, he was so excited about this. He went and he learned at least some SQL so that he could go through and analyze his own data. And then he was like, okay. I have the perfect filter for this, you know, extremely rare genetic variant that I'm studying. And so then he gave me, like, all the SQL, you know, commands, and then I was able to go back through and write a script, you know, then apply the knowledge that really he had created. I just kinda handed him the tool, and he went through and found all this, you know, then we were able to work together and develop, you know, like a filtration pipeline for the data that they were working on. So, yeah, more tools that you can hand to the scientists, the better. You know, the more research they're gonna find, the more interesting results, everything.

As you're building out some of the data engineering stacks and the technology tool chains for these researchers and scientists and the people who are working with the data to build out the models and the analysis,

what are some of the attributes that you're generally trying to optimize for? Is it things like speed of analysis or being able to churn through the data? Is it things like the user experience

to try and push down some of the distributed systems concepts or anything like that? Is it the scale of being able to process the data

at certain volumes correctness because you want to ensure that there's no potential for error being introduced because of, you know, some of the quirks of distributed systems or just anything along those general lines? I would say first, we tend to go for process and speed. I do talk to quite a bit of scientists who are, like, right off the bat, they're, like, listen. I wouldn't even be talking to you because 10 years ago, I could do this analysis on the lab computer in Excel and it was fine, but now my datasets are too big. And so now, you know, like, I'm talking to you to come fix that for me. And so then, you know, the very first thing that we do is take their analysis off the off the lab computer and go, you know, package it into a Docker container and go use it. So I would say, yes, like speed and performance tends to be, you know, the first box that we want to check anyways. And then once we have that, yeah, then absolutely. I move on to usability.

I'll try to get in there with, you know, some parallel computing options.

It really depends too on exactly

how much they wanna know as well. Some of the scientists, you know, are, like, they're fine with just saying, hey. I developed the analysis and I want for you to apply it, and it can just be this kind of black box that sits over here and I don't care. Whereas some people, you know, they really kind of wanna

learn those skills for themselves, or maybe they're building up a team and they want to have those skill sets on their team. Yeah, it does tend to depend a bit on what people want, but I would say performance and time

always first and kind of get the data out there, you know, because that's what brings in the funding and the money ultimately so that they can keep going. In terms of the actual sourcing of the data,

how much of that process do you work to try and integrate with and automate into some of the challenges of doing the data collection and data integration so that it's at a point where the analysis and the overall data preparation can be done? I would say quite a bit, although I'm pretty specialized in 2 main areas, which is clinical genomics and high content screening. So clinical genomics is, like I said, genomics data. If you've ever

done 23 andMe or maybe taken any kind of, you know, genetic test for health family issues or anything like that, you're kind of familiar with what genomics data looks like. If you aren't, doing 23 and me is actually pretty cool. They give you they give you, like, a ton of data on, you know, where your ancestors are from and maybe some health things and things like that. So that can be kind of fun if you ever wanna go do that. So that's the genomics data that comes off of the sequencer and the pipelines and the tool sets for that type of data tend to be very, very well defined.

So normally, it's a pretty simple kind of data pipeline. It's like, okay, data comes off the sequencer.

Typically there's a predefined pipeline for doing kind of the raw processing of that data and getting it to at least like, you know, like a QC kind of stage. And the other type of data that I work with is called high content screening, which is images from a microscope. And these days, they are robotic microscopes because, of course, they are. And so that tends to be something very similar. The microscope, you know, takes its pictures,

spits those out onto a file system. And then once they appear on the file system, generally, you wanna, you know, you wanna push them along to be analyzed.

So they're both very similar. You know, you get data, it gets pushed out to a it appears on a file system, and then you wanna take that data and and generally analyze it, you know, using some kind of HPC cluster.

Struggling with broken pipelines,

stale dashboards, missing data?

If this resonates with you, you're not alone.

Data engineers struggling with unreliable data need look no further than Monte Carlo, the world's first end to end, fully automated data observability

platform.

In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem with broken data pipelines.

Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence,

reducing the time to detection and resolution from weeks or days to just minutes.

Start trusting your data with Monte Carlo today. Visitdataengineeringpodcast.com/impact

today to save your spot at Impact, the Data Observability Summit, a half day virtual event featuring the first US chief data scientist, the founder of the data mesh, the creator of Apache Airflow, and more data pioneers

spearheading some of the biggest movements in data.

The first 50 people who RSVP

will be entered to win an Oculus Quest 2.

In terms of the data formats,

you mentioned that for the microscopes, it's largely going to be image files. And then, you know, a number of people are probably familiar with the fact that genomics data is largely going to be strings of text and needing to be able to chunk that at appropriate intervals. But what are some of the other types of data formats that you're likely to interact with? I know you mentioned HDF 5 earlier, which is, I know, something very popular in the scientific community, but not something that most people who are dealing with, you know, typical business analytics are likely to run into. I'm wondering if there are any other interesting or quirky formats that you encounter.

Oh, there are. It used to be a thing in bioinformatics

where I think everybody had a hobby of creating new file formats, and so there used to just be a ton of them floating around. I think we've standardized

at least a little bit. In genomics, you tend to have,

you know, 1 kind of set of files, so you have your raw reads that come off the sequencer. These are just kinda straight DNA, like you're saying. They're very much strings, but since it's large data, we kind of encode it. And then because it comes from, like, an actual physical machine, it also gets encoded with, you know, like, quality data as well. Like, how much

kind of

confidence does the machine have that, like, okay, I'm reading, you know, an a here and a t here and a c here and so on and so forth.

So we have those files. Those are called fast q files, and those are the raw reads. Those are, like, the raw DNA before any processing is done.

And then after that, what we do is we take the raw reads, and we have to align those back to a reference genome, which is essentially

a very long string comparison

problem. Like, if you've ever studied any of the algorithms where you do string comparisons, you have a string here and a string here, and then you try to line them up and see where they are most similar. So that would be a BAM file. And then in genomics data, we want

to get all the way down the pipeline to this type of file that's called a variant call format.

And a variant call is essentially where you say, you know, let's take humans, for example. All humans are about, like, 98%

similar to 1 another, but then

in different parts of the genome, we'll have, you know, a particular set of DNA that's different.

And you can take those and you can do all kinds of comparisons. You can compare people within a population. So you could see, like, Africans

Western Europeans versus Americans versus South Americans

and kind of cluster people together based on their ethnicity.

You can do, like, different disease studies.

There are quite a lot of diseases that are caused by just 1 change in the DNA, just 1. So I think 1 of the first ones that was found was cystic fibrosis,

and there are actually several

changes in the DNA that can contribute to that. So that's another file format that we have, VCF,

and that's probably the most widely used 1 in genomics. And what that does is that describes,

as opposed to the entire genome, just, okay, here are the places where it's different. You know? So if we're saying 99% of the human population is the same in, you know, the certain position and then this individual is different right here, that's something important that we would wanna know about. And then once you have the variant call file formats, then, of course, you have all kinds of annotation databases.

I would say there's not so much of a standardized format for those yet. I think some people use kinda typical databases.

There are all kinds of interesting annotation formats out there, but these are to really say, like, okay. You know, it's been found in clinical studies that

this difference in this DNA here contributes,

you know, to this type of disease, that kind of annotation.

For a data engineer who's interested in working in bioinformatics

or with life scientists, how much overall domain knowledge do you think is necessary to be able to help with building out some of these data pipelines or data analysis versus

familiarity with the underlying

processing technologies and distributed systems concepts?

I would say that you could start off off on very little. Like, you know, I think over the years, I've worked with plenty of people who,

you know, they come in maybe because there's an opportunity there or because they've moved someplace that just has a lot of opportunity within the life sciences

or,

you know, that just happened to be a job posting, you know, and they happen to get it and they don't have any background in, you know, biology or even a biology related field, but just come in because that's where the opportunity

arose, you know, and they can really get up and running pretty quickly. Particularly if you do have this kind of data engineering or parallel computing background, there tends to be a lot of call for that. There is so much more demand I've found that can really be met. So if you're a data engineer

and you wanna go get into life sciences, you know, go for it. In terms of the overall data management aspect of it, how much of the,

sort of governance concerns end up coming into play as far as

personally identifiable information and data privacy and data security aspects? And what are some of the

ways that that manifests in

the management of the data, and how much of it needs to be customized specifically to bioinformatics

versus things that can be sort of applied from the just general space of data management and data governance?

Yeah. This is 1 that really gets a lot of research groups. It's been the downfall of quite a few. Generally, when you're working with human data,

it has to be anonymized and it generally has to go through, type of data compliance called HIPAA, which is, like, if you've ever been in a doctor's office and you've seen, like, don't talk about patients in the elevator or something like that, that's HIPAA compliance. You're not just supposed to talk about people's medical data. You know, everybody should have privacy. I think, like, hospitals, they call people guests now instead of patients or, you know, or something like that. So,

yeah, I mean, again, it really depends on the type of data that you're working with. If you're working on,

you know, maybe

animals

or

sea life or coral or climatology or something like that, it might not be as much of an issue. But if you are working on any type of human or medical data, there's pretty much always gonna be some kind of HIPAA compliance, and you have to make sure that you are aware of that. Probably 1 of the biggest worries actually is not even making sure that the data is anonymous, but making sure you don't lose data, which means you have to have backups of data,

which means you can't have, you know, data scientists just kind of being able to, like, run around in the raw data. You have to separate out your raw data and make sure that's read only. Make sure that is backed up several times. You're really making sure not to lose any data because that is another really huge issue. That actually came up recently during COVID. There was a company in the UK,

and I don't wanna rag on this company because you just know there was, like, some poor data engineers sleeping under their desks for, like, a couple weeks before this happened,

but they were collecting data using Excel. They were collecting COVID data using Excel, and Excel has a number as, like, a limit on the number of rows that you can have on there. And apparently, they had reached that limit. And they didn't realize this because,

you know, again, I think that the poor people were being very overworked probably and sleeping under their desks. But the end result of that is that they lost a lot of data, and it was raw data. It's not data that can be generated again because it was being collected, like, in real time. It was also supposed to be used for contact tracing for COVID within the UK. You know, so they lost quite a bit of data. It couldn't be generated again. And then 1 of the other issues that came about with this was, you know, was the issues around data compliance. You can't just lose medical data. It is actually part of people's privacy.

Patient's privacy is having or part of patient data anyways is having a complete picture of the history of the patient.

And if you lose, like, a big chunk of data in the middle, then you can't really say that you're making informed decisions about,

you know, anything, but particularly about their treatment.

Or in this case, it was contact tracing. So if you're missing a whole bunch of contact data, you can't really be making informed decisions

about the, you know, spreadability of COVID in this case. Yes.

Every engineer has their horror story. I'm glad that that 1 is not mine.

Oh, I know. I'm so glad. I feel so bad for that team though because you just I don't know. You just know something happened where, like, somebody at some point was, like, maybe we should be using a database instead of Excel with,

you know, sessions and commits and these, you know, and counts and these things we might not wanna have. And, you know, when somebody else was like, no, we need to get this out the door now. You know? And, like, you know, I'm just I'm sure something like that happened. There's always the aspect of, we don't have time to do it right. We just need to get it done.

And it usually ends up turning out wrong.

Which is another aspect that we all have. You know, we can all have these high and lofty ideas about how we should be doing things, but then at the end of the day, you know, maybe you only have a certain amount of money to back up your data. And, you know, once that's gone, that's gone. So then what are you gonna do? What if you can't afford to keep backups of all your raw data or, you know, like, maybe each step in the pipeline and things like this? How are you gonna make those kind of decisions and trade offs? Absolutely.

And also in terms of the kind of durability

and backups of the data for the customers that you're working with, I'm wondering if they have also been taking advantage of some of the

recent projects that have been coming in to handle things like data versioning with either DVC

or Lake FS and technologies along those lines for being able to

have some of the kind of Git like semantics of branch and merge for the different datasets to explore different potentialities?

I haven't heard about that too much in bioinformatics.

It tends to, like, take us a bit to adopt the kinda cool new tools. It seems like it tends to sort of go through, like, the physics and the climatology folks first, and then, eventually, it comes down to us. And I think some of that is due to the fact that, you know, biologists are all doing the best that they can, but for whatever reason,

I think biology is the last holdout in the STEM kind of fields and in the sciences

where you can get a degree in biology, at least in the US, and you don't ever have to take a programming course

of, like, any sort, whereas that isn't true in any of the other sciences If you study physics or neuroscience

or chemistry, for example, you know, you do take at least, like, some level of programming.

So I found there is a bit of a hurdle for people to get over for that.

I know, like, those it's not quite as cool as some of these newer technologies, but people have been kinda taking advantage of the object based storage and using versioning,

for example, an s 3 buckets and then, you know, also replicating their buckets and things like that. I suppose that's a very common practice.

In terms of your work in bioinformatics,

both in academia and now working freelance with different customers. What are some of the most interesting or innovative or unexpected solutions that you've seen for being able to manipulate

and analyze bioinformatics

data, particularly from a data engineering perspective?

I am a really huge fan of DASK. Like, I just, you know, I talk about them all the time, and I love them. And I find it is so great because, you know, it is really, like, 1 of the best tools that I feel like I can put in the hands of scientists that they can really go through and start to optimize,

you know, their analyses themselves.

So, for example, 1 of the kind of projects that I used to work on a lot was that, you know, these scientists would have this data, and then either they would develop an analysis, and when they developed it, they would develop on a subset of their data because, you know, they don't wanna develop it on the entire dataset. It's too big. It will take forever to develop, so on and so forth. Or they would have an existing analysis, and then they would get a new sequencer and the resolution of the data would, you know, just exponentially curve upwards. And then they would bring in somebody like me, and I would kind of, you know, go, like, bang around on whatever kind of software program it was that they were using and, you know, get out my profiler.

I still think Perl has, like, the best that just has the nicest profiler that I've ever seen. I still don't like the Python 1 quite so much. But, you know, essentially go through and be like, where can we get the biggest kind of bang for our buck in terms of optimizing this analysis, maybe,

you know, spreading some things out either

with an encode kind of solution like MPI or spreading it out within the HPC scheduler itself. What is the 1 thing that we can do that will be the easiest that will make the most impact?

And ever since DAS came out, I just immediately moved to that. It is so much easier for scientists than trying to learn MPI

or trying to, you know, or even trying to understand maybe the HPC scheduler itself.

If instead of trying to understand the HPC scheduler, they can just go to the DASK

options configuration,

and they can say, you know, give me a minimum of

1 kind of DASK worker and a maximum of, like, 20, and DASK will auto scale that for you based on the demand. It does quite a good job with that. They can do things like parallelize for loops, which are, like, my software, you you know, people like to make fun of me a bit for, you know, like, oh, you're going through and parallelizing 4 loops, but that has saved quite a lot of projects actually that I've been working on is that, you you know, like when we went through and did the profiler,

oh, this 1 for loop is what's taking the most amount of time, and it's not that complicated. We can just, you know, throw this. We can just paralyze it, and everything will be fine. You know? And now being able to go and teach scientists how to do that with their own code, like, here's DASK,

Throw it in here, and then you'll actually be able to do your analysis

in a day instead of a week, let's say. That makes a huge impact.

In your own experience

of working in bioinformatics

and helping to build out some of the technology stacks for it, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I would say sometimes it's very difficult to understand the data itself. You know, so for example, sometimes you're working on these really niche scientific problems and especially this has been the case for me when I've tried to sort of get involved in other kind of branches of science. Like, I started off mostly being involved in clinical genomics, and then I wound up also getting involved in this other area of science called high content screening. And initially, I had a very difficult time understanding, you know, like, the data, what it was about, how the experiments were set up, what was even the point of setting them up like this.

And so I was making some decisions that were, you know, not exactly optimal because I wasn't understanding it well enough and just going through that process,

you know, definitely taught me quite a lot in terms of, you know, okay, slow down, like, really make sure that you're understanding everything that's happening, go talk to the scientists, read the papers,

do everything that really needs to be done or else you're gonna be making, you know, these kind of goofy decisions.

1 thing that seems like it is a sort of potential outcome of

bioinformatics and life sciences is that

with, you know, business analytics, a lot of the times, what you're trying to optimize for is the repeatability of a particular processing pipeline because you're always going to be wanting to answer the same set of questions and just seeing how those evolve over time. Whereas if you're in a

research context, it seems that there may be cases where you've established a

desired outcome you wanna be able to run a set of data in

to determine what is the best course of treatment, for instance. But if you're in more of a research

exploration loop, that the flexibility of the processing is paramount. And I'm wondering what you see as being

the sort of dominant

method of working with the data. If it's more just I need to be able to do all of this exploration iteratively,

and I need the flexibility in in the pipeline, or I need to be able to do this repeatedly, and so I need to be able to make sure that there is consistency in how all of these processes are being executed.

Initially, I would say it's very much an exploration,

and, you know, everybody is just kinda figuring it out as they go along.

That tends to be the name of the game in research. You know, you can't

really throw money into research necessarily expecting that you're gonna get something out right away. You know, it's just the way that it is. So I think everybody kinda goes into it with at least somewhat of that expectation. And then as time goes on,

yeah, then you need to kind of shift focus from, okay, you know,

the data exploration

to really setting up your producible pipelines, which is always very, very interesting for me to work with, you know, kind of startups that are sort of in that transition of, okay, we have an analysis and now we need to run it, like, 10000 times or something and, you know, we need to have maybe we have SLAs, and we need auditing, and we need this, and we need that. And it's a very different mindset I found than doing the initial exploration.

It's also part of the research stack. So I always think of kind of the research stack as being like, you know, you have your DevOps in the data engineering side anyways.

You have your kind of high performance compute cluster, and that's meant to be very much for the exploration and the analysis and maybe doing kind of grid search across

all the different algorithms and, you know, and things that you can do with that. And then, you know, you tend to get to some point where you're saying, okay, I have something here, and then you need to

kind of clean that up and find a way to present it to other people, pharmaceutical companies, BC,

you know, people who are gonna give you money, whatever, something like that. And then sort of once you've reached that stage where you say, you know, or once the scientists have reached that stage, they'll say, okay. We have something here. And then all of a sudden, they need to shift focus from doing

all of this very explorative kind of data analysis

to, again, having this pipeline

that needs to be really robust. They really need to, like, you know, track exactly what is happening every step in the pipeline. Any data that comes in, you need to have very careful,

you know, auditing and life cycle policies, all this kind of thing because, again, that's another part of the HIPAA compliance in particular.

You know, you really need to make sure that you know every time the data is being touched, you know, where

you know, what's going on with that? How is it being processed? All this kind of thing. So that's a very different focus for the scientists, and then we tend to go through

and kind of start with what I suppose is the next phase where, you know, we're building Docker containers and we're setting up, you know, like CICD

and we're versioning their analysis and versioning maybe even their machine learning models

and their annotation test data sets and all these kind of things, you know, and because biological data is very complex, often you're bringing in a lot of different data sets from a lot of different, you know, like maybe databases

or annotation sets or, you know, you're just bringing in a lot of disparate data to do your analysis. And you have to make sure that you're tracking where everything is coming from. To that point, I'm wondering how much utility you've gotten from some of the sort of metadata management and data lineage projects that are starting to become more prevalent.

Not very much. It's still, for the most part, something that I've been doing pretty manually, which is kinda terrible to say on on a podcast like this. But I guess,

you know, because

when these analyses are being developed, maybe you get, like, 1 of these new analyses.

You know, I mean, maybe, like, every couple years or something like that. And once you have it, you do not touch it for a long time. You know, you do not touch it until there's a new version of the human genome project out or something like that. It just stays there.

And because there is, you know, so much riding

on these particular analyses, you tend to really have to have like a person in the loop who's very carefully going through and documenting

each 1 of the data sources.

But those do look interesting. I do need to go get into the kind of versioning data and, like, all these interesting kind of ways of tracking metadata and annotation data and things. I am seeing that in the bioinformatics space. There's a very interesting project. I think it's called the bio things API.

I think it's using Neo 4J to create these these graph based,

you know, representations

of biological data and is setting up these, like, really nice, really robust, really well documented APIs for

all of these different databases so that you have, like, a very consistent way of accessing

so many different types of data. And then as a part of that, you do get this information that you were discussing, like what's the version of the ClinVar database that we're using or the version of the human genome

or, you know, like, what ethnicity is this person being compared to? Like, all these kind of important pieces of metadata that you need to know as you're going through with your analysis.

Another aspect that you mentioned as far as the auditability of the overall process. I'm curious how that factors into some of the

ways that machine learning is used in the space given the potential, particularly for deep learning, to have some

unknown

motivation for a given outcome, given a particular set of inputs?

I would say that that tends to be less kind of

prevalent, I suppose, with biological data because,

like, nobody trusts anything that just came from a computer in biology. It has to be backed up in the lab. If it's not backed up in the lab, as far as anybody's concerned, you don't have data, especially in, you know, like a drug trial or any kind of medical data. You know, say somebody had some finding that came out of, you know, like a machine learning model experiment, and they couldn't back that up in the lab. They wouldn't have a finding anymore.

So,

yeah, I don't know. I just I just don't think that's quite so prevalent because you do still need to back everything up. It'll be interesting to see then how the outputs of the AlphaFold project end up percolating through the biology community or if it does. Yeah. We'll have to see if I think overall, like, the biology community is is a bit slower

to change than maybe some of the other communities, especially, like, the growth that I've seen, in climatology, because I always keep my eye on that. I find it to be very interesting or the computer vision folks. You

know, in particular, like, we get a lot of trickle down from computer vision and high content screening because high content screening is images from a microscope. Right? So then you can go through

and you can, you know, theoretically analyze a lot of your images using some of the newer computer vision algorithms that has caused quite a lot of discussions. Like, well, but how do we know this is actually analyzing the thing that we think that it's analyzing or that it's not having, like, some other kind of bias? And the answer is, like,

I don't know. Maybe. Go back it up in the lab, and we'll be fine. As you continue to work in the field and work with some of these different companies, what are some of the overall trends in industry and academia

or some of the upcoming technologies that you're particularly tracking for application in bioinformatics workflows? So I think for right now, kind of the area of research, So I think for right now,

kind

of the area of research that I'm really, really interested in is, you know, this really kinda usability,

this idea of usability where I really wanna go and put high performance computing in the hands of researchers. I want for them to be able to spin up

and deploy and automate and do, you know, all the things, you know, for their own infrastructure and really kind of be able to take ownership of that. I don't think they should have to have, you know, somebody like me or, you know, an IT department

or essentially anybody kinda standing in the way. And, again, you know, I think the software engineering community has done a really good job with this with the, you know, with the different projects that have come out of the PIDATA and, you know, the tidyverse kinda ecosystems. And I would really like to see this happen also in the DevOps and data engineering communities.

So kind of my big project for

the time being anyways is I'm building a SaaS application

that

I'm really giving it kind of a, like, a science first approach where the idea is that, you know, because I found that so often I go and talk to scientists and so many of them, like, don't even care about the high performance compute system. They care about the science. So often what happens is that when I'm talking to people, people will come to me and they'll say something like, oh, I see that you have this article on doing high content screening for this, you know, particular dataset using this particular analysis package, and that's what I wanna do. And they don't even care about, like, the data science infrastructure or that I know how to deploy HPCs. What they care about is that I know how to do, you know, their particular type of data analysis in a way that scales. And so I'd really like to keep focusing on that approach

and create an application. It's tentatively called Bio Deploy.

I had another name and that 1 was already taken, so I'm, like, a little bit hesitant to name anything right now. But, yeah, but, essentially, you know, really concentrate, you know, on the kind of problems that the scientists wanna face. So instead of saying to the scientists, like,

you know, hey. You wanna deploy a high performance compute cluster? Say, hey. You wanna analyze, you know, your single cell or genomics data and, you know, start from

the science and then work our way down in a way that kind of abstracts,

you know, kind of the nuts and bolts of the actual DevOps because, again, I feel like the the software engineering community has already done a very good job with that in terms of the PIData projects. And I think we can get to that point in kind of the DevOps and the data engineering community

as well. So that's kind of my focus for the next little bit. I'm throwing

like, all the individual recipes and things are all gonna be completely open source. I think maybe the application itself will be open source too. I just I haven't figured out exactly how I'm gonna support that because there is 1 me and there are a lot of biotech startups, and this is becoming a problem for me. But, you know, so far, everything is gonna be is gonna be open sourced in the next, you know, at least couple months, I'm really gonna sit down and focus on,

okay, how do I take this and make it as usable as possible to the scientists? Because 1 of the other things I always, like, you know, run around saying is that, yeah, we have tech problems. Right? We certainly have tech problems, but I think we have considerably more

people problems and getting, you know, the people from the different disciplines to really, you know, sit down and understand 1 another. So, you know, for example, getting the IT folks to understand the data scientists or

the software engineers to study the biologists or, you know, like, all this kind of thing. So I think that's kind of, you know, really the area that I'm most focused

on for the next I don't know. For the time being, we'll see how long it goes. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap at the tooling or technology that's available for data management to Oh,

I would suppose being able to get really good version control on very, very large datasets without it being incredibly cost prohibitive.

You know, confirm going with that. Yeah. Definitely.

Well, thank you very much for taking the time today to join me and share the work that you've been doing in the bioinformatics space. It's definitely a very interesting problem domain and 1 that I've been fortunate enough to speak with a few different folks about. So thank you for all the time and effort you've put into that, and I hope you enjoy the rest of your day. Alright. Thank you. You too. Thanks for having me on the show. I really enjoyed it.

For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of data engineering podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links