Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto

Hello, and welcome to the data engineering podcast, the show about modern data management.

When you're ready to build your next pipeline, you'll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and 40 gigabit network, all controlled by a brand new API, you'll get everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

to get a $20 credit and launch a new server in under a minute.

And are you struggling to keep up with customer requests and letting errors slip into production?

Wanna try some of the innovative ideas in this podcast but don't have time?

DataKitchens' DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and datasets while improving quality.

Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end data ops solution with minimal programming that uses the tools you love.

Join the DataOps movement today and sign up for the newsletter at datakitchen.iode.

After that, learn more about why you should be doing data ops by listening to the head chef in the data kitchen at data engineering podcast.com/datakitchen.

And go to dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.

Your host is Tobias Macy, and today, I'm interviewing Kevin Duran and Andy Lopresto about Apache NiFi. So, Kevin, could you start by introducing yourself?

Sure. My name is Kevin Duran. I'm a committer on the Apache NiFi project. It just means that I submit contributions to it, and I help review other people's contributions to the project. I've been doing that, for about a year. I'm also a,

software engineer at Hortonworks,

which is 1 of the vendors for Apache NiFi. They've got a product called

Hortonworks Dataflow, which is basically NiFi that comes with, enterprise support.

I'm on Hortonworks on the data flow management team, which means I contribute to products like Apache NiFi. It's always nice being able to get paid to do open source. And, Andy, how about yourself? Hi. I'm Andy Lefesco.

Like Kevin, I am a promoter on the Apache Wi Fi project. I'm also a member of the PMC, which is the project management committee.

So members of PMC make decisions

about the future of the project.

We can cast, binding votes of witnesses that will perform couple other prosecuting tasks.

I'm a senior member of technical staff at Fort York. So, so like Kevin, I get paid to work on open source software. And I'm on the core engineering team, so get to work on Wi Fi every day. I focus on information security,

and that includes cryptographic services, data protection, threat modeling, and vulnerability assessments,

mitigations. And in the past, I worked for Apple and some health care, like, with startups. And, Kevin, again, do you remember how you first got involved in the area of data management? Kinda indirectly, like, without really knowledge of what I was doing. So out of college, I was doing research and working with in the areas of networking protocols and cybersecurity, so not directly like data analytics. But we were working with a lot of data, and it was in formats like CSV, XML, JSON. So just out of necessity, we would write scripts to deal with these large datasets,

things like Perl and later Python to manipulate the data,

and, you know, so

just sort

of scripting some jobs around structured data to process it. And that works for a while. But over time, those

research projects, they keep going and new requirements get added to them. You're looking to other areas. You have new datasets. So those, you know, it starts as a couple of simple scripts. They grow. You get new requirements like how do you process and deal with binary data? How do you do

continuous processing or batch jobs when data we were collecting

would get input into the system or

other partners would send us big datasets that we'd have to process. So, you know, while we're doing research and we've got the domain we're focusing on, there's also, like, this huge data problem and data management problem that we have to deal with. So it turns out that was taking up

a lot of the time

as, you know, the actual research domain work. So that was like a good introduction to, I guess, the wrong way to do data management cause we made a lot of mistakes. But it was a good learning experience.

Later on a little bit later in my career, I was working actually at an,

an advertising optimization platform for online advertising.

And it was the first time I was exposed to, what we'd call, like, Internet scale data. So billions of records per day, continuously flowing, being fed to various processing chains, feeding algorithms, feeding models, data coming in to enrich our data from 3rd parties, our data going out and package delivered to partners, so constant

everyday data management problems. And so that was

a really good learning experience for,

some more advanced data management frameworks, data storage tools, data archiving policies, and,

you know, connecting

decoupled systems

that need to share data and interact with each other. And that's where I really started to understand, like, some unique aspects of data engineering and data management. And, Andy, do you remember how you first got involved in the area of data management? Yeah. I've actually worked on data management for most of my career, which I did not intend to do either.

I started by writing these internal management systems for a a startup when when I was an intern in high school. That was things like an asset management database and a document management database and a applicant tracking system,

just because they were small and didn't have any anybody else to do that. And from there, I went on to or after your flow with Documentum.

This was these, like, workflow and life cycle software tools.

And then started building out applications and APIs that were doing metrics tracking and and analysis.

And finally, when I went to Apple, I was actually working on information protection across the worldwide retail stores. So throughout the world, all the financial data and tracking information and customer information. So it was really sensitive stuff.

And, like Kevin said, that was kind of my first time seeing that kind of data on that scale. So that was that that was my introduction to data management.

And as I mentioned at the opening, we're here to focus on the work that you're both doing with the NiFi project at the Apache Software Foundation.

So can you start by explaining a bit about what NiFi is for anybody who might not have heard about it yet? Andy, you wanna take this 1 first? Yeah. Thanks, Dan. NiFi is a dataflow tool, and that is a very simple sentence, but dataflow means different things to a lot of different people. And so the analogy that we like to use is that consider data like water. NiFi is not here to help you build these complicated sewer pipes where you could design it and then you construct it and then test it and it's a giant system and then you like flip a switch and turn the data on. Data is always flowing, it's always there. So NiFi lets you dig an irrigation ditch off of a river. And so you see data flowing in real time, and you can interact with that

and expand the ditch, add another 1 in a different direction, plug 1 up. You see what's happening with real data as it's flowing. So all that monitoring and feedback comes from the original

demand for NiFi. And NiFi was initially conceived of,

at the US National Security Agency by Joe Witt, because he was encountering a lot of these brittle data movement systems across the organization. And Kevin mentioned that in, you know, his experience as well, all these little scripts and things like that. The data is mission critical, but all these moving parts, a lot of it wasn't repeatable or resilient in face of disruption.

So if there's a connection that goes down at 3 in the morning, you probably don't have a software engineer on standby to write code to change the destination of a stream,

or you don't have a Perl expert who can look at a 100 parentheses and semicolons and figure out what needs to change in that second. So these giant operation centers, like if you remember the show 24, like all the screens on the wall and people on these, you know, tiered,

platforms and throwing things different screens, like, you need to keep the data moving at that time. So these people who are not,

software developers, not coders, but were operations people,

needed to have a tool where they could take immediate action and keep the data flowing, and be able to see the problems as it happened or ideally even before it happened. So NiFi matured for quite a bit, a number of years inside the organization and then it spread to a bunch of different teams. And in 2014,

the agency, released MiFi to the Apache Software Foundation through their,

technology transfer program. And so it's continued to grow with a number of contributors in the open source world and probably more than it ever would have been able to inside the organization. And you mentioned that the primary piece of NiFi is for being able to, as you put it, build these irrigation ditches and then redistribute the water flow as it starts working its way through the system and being able to evaluate

in real time and

very reactively

how the data is flowing and modify it while it's in flight? And is the intent

to, in the general case, eventually

codify some of those data flows with some more robust systems in in your case, the sewer pipe analogy? So that's certainly an option. What we have found, I think, through a number of different users or customers is that some organizations like that reactive

and ever evolving model and being able to receive feedback immediately and have this very tight,

life cycle developing their flows. Other organizations

very much prefer the more traditional approach of building something, deploying it, and having it stay

constant until

unintentional change is made. And so I'm not gonna say it's a 1 to 1 match, but it it is similar to

agile workflow versus waterfall software development life cycle. And a lot of organizations

that, are deploying, you know, NiFi into production environments and have these these flows

have rules and and, you know, whether it's externally,

regulatory or their own internal, you know, rules

and systems about how things in production can change. And so they have these approval processes and, you know, you need to get sign off from a certain team and,

change can only happen at a certain time and things like that. And very restrictive access control for the production systems that, you know, user isn't changing a flow while it's deployed live in production. So we've seen that kind of,

the demand for that grow, quite a bit in the now in the open source world, even being deployed to private organizations. So that I think is more along the lines of what you're asking about, you know, codifying these into more rigid perpetual

data flows. The 1 thing I would add to that is that NiFi, it's really good in an environment where you have changing requirements. So when you do have a data flow that is codified in some other form, it's great

until the data changes or the requirement. And so then, you know, where NiFi's strength at the beginning is getting started and building a data flow very quickly and visually. It's also its strength when it comes time to modify a data flow to reflect the data or the requirements of change. That's a great point, Tim. Yeah. That's definitely 1 of the areas that comes up a lot in conversations that I've been having is that as you build these very robust and structured systems that are intended to be sort of industrial strength and able to process massive amounts of data is that data, as everyone knows, is very messy

and, you know, potentially changing. And if you don't control the source of the data, then you can potentially be surprised by having, you know, maybe some format of it modified, or adding a field, or removing a field that all of a sudden breaks all of your assumptions

and thereby

requiring a lot of upfront investment in terms of alerting and monitoring and testing of the data as it hits these various transition points through these different systems.

Yeah. Yeah. Exactly. And, that's kind of the motivation for why

NiFi went with a web UI for actually defining and and authoring the data flow. As Andy mentioned, 1 reason was that it was designed to be a system that could be used by non programmers, by operations folks who might be very good at managing data and understanding it, but just don't have a background in traditional software

programming. But that's, you know, to make it a tool for designing and implementing a data flow visually and graphically, it is also its strength when it comes to, you know, going back to a flow that's been running for 6 months and making a change to it.

You know, for 1, it's it's very quick to remember what it was doing in the first place. You haven't seen it for a while because it's visual, and you have those visual cues and descriptions as to what's going on. And then, you know, also when you're making a change, it's it's through a UI. It's easier

to understand and reason about. You've got, you know, arrows and connections showing how 1 thing goes to another thing. You don't have to, you know, trace code through an IDE. So it's designed to solve that problem in an environment where you need to be reactive or agile, and and you expect your flow to change over time. So it's a it's a the interface to NiFi reflects that

design goal. And how did you each get involved in the project? Yeah. You wanna go first?

Sure. Andy kinda wrote me into it.

No. I so I I mean, that's true and and and not true. I started hearing more and more about this project called Apache NiFi. Probably this was around like, I don't know, 2016

timeframe.

It just started popping up like everywhere in my world. So where I was working at the time, we are investigating solutions to build out our next generation data flow system. It was gonna connect internal systems, bring data in from external third parties, and the way we were doing this wasn't gonna scale. It wasn't very maintainable, so we were looking for a better solution to design our replacement on top of. NiFi came up then. It looked really powerful.

Around the same time, I started hearing from other developers in my personal network, college and various jobs. They'd all sort of ended up through, you know, multiple hops at different companies. A lot of them had landed working on this Apache NiFi project, coincidentally.

And 1 of these was Andy, and these were people who I knew really well. I knew they were good at they were due. They certainly would not be hurting for opportunities

in this industry. So the fact that they had all chosen to come work on this project really said a lot to me. So when the time came that I was looking for something new to do, I reached out to them and, you know, asked, what's going on with this Apache NiFi project? Did you have need for more help on it? And, figured out how I could get involved, and it was a great choice. I love working on it. Andy, how did you get involved in the project? So I was working on different problems,

but I was sitting next to and and good friends with a few of the early developers,

on my 5. And

so it was something I was tangentially aware of. And it really wasn't until my time out at Apple where

doing large dataflow problems and didn't have a tool like this. And and so we were kind of building a lot of disparate systems internally. And then I kind of realized how important the visibility, the responsiveness, the resiliency, the detection aspects of data flow problem and the inherent solutions

are because other organizations just don't have access to this or at the time didn't and built used to spend so much time and energy and resources trying to build their own system. So that was how I got, you know, interested in the project. And then when the project was open sourced, I had an opportunity to

join a startup that was working on it and had all of the original people who built the thing from scratch. And I turned it down because I wanted to stay in California instead of moving back to Maryland. And then 6 months later I had an opportunity to join it it again because they had been acquired by Hortonworks, and Hortonworks

would allow me to to stay in California.

So I jumped at that, and that was, almost 3 years ago.

And we've talked a little bit about some of the workflow and use cases for NiFi. But in your in your perspective,

where do you see it sitting in the broader landscape of data tools where you have, on 1 end of the extreme,

some of these industrial strength systems that we talked about briefly such as Spark or Flink or Kafka,

where you're processing massive amounts of data at high velocity.

And at, you know, maybe the other end of this is of the extreme, you've got individual scripts or

orchestration

tools such as Luigi or Oozie or Airflow for being able to,

trigger various jobs in various external systems. Yeah. So Yeah. That's that's a great question. Go ahead, Gabe. I I definitely think, you know, so, 1, NiFi is very mature.

It was worked on and developed and hardened internally at NSA for a long time before it had its open source,

debut. So it's it's definitely battle tested and pretty mature, and that also comes with a ton of integration with these other systems, which make NiFi a really, really great complimentary tool

to a lot of the other systems you mentioned, the other data management and analytics and storage technologies that are out there. Because

if it has an API to write data to or read data from,

or even if it doesn't, chances are NiFi can connect to it pretty easily using

tools that are built into NiFi out of the box called processors.

And so a lot of processors are adapters to other systems.

And if 1 of the few 100 out of the box processors or community available processors

doesn't do exactly what you need, NiFi is really extensible.

So as Andy mentioned, NiFi is all about data in motion, data flowing. If you have a data platform

that has 1 of these other systems in it, chances are you need to feed it with data or take its output and store that somewhere else. And NiFi is just a great, great tool to do that job for you. And so in some ways,

it seems like it is also

sitting in a similar area to tools such as fluent d, which are intended to just be the sort of integration points between multiple systems and being able to route data, you know, from 1 location to another. I think how you view NiFi really depends on your

experience and familiarity with,

other tools.

So there are a lot of people who have used

ETL systems, and that's how they see NiFi. And it's not really what it is, but that's the context that they have, so that's how they approach it. There are other people

who have, job or workflow

orchestrator

experiences.

Some people have this enterprise serial bus.

So we feel pretty strong that NiFi is for moving data.

It's not necessarily

the the backbone of your system. It's not,

you know, a big storage leak for your analysis.

It's not a complex event processor.

There are a lot of things that it's not, but what it really is is just this

a a platform

that should let you get the data from where it's generated as early as possible

and put it into the locations that consumers, your your other tools or other users, need it to be and in the format that it needs to be. So it does a lot of transformation, enrichment,

filtering, prioritization,

routing.

It's it's really, like Kevin said, it's very extensible.

We really we want it to be able to work with whatever offering you have so that you don't have to do the extra work on those source systems or the follow on systems to get it into a different format when NiFi can do that for you. And then there's also a sub project, Apache NiFi Minify,

which is designed to kind of extend the your reach farther out into this giant data ecosystem

so you can collect and manage data earlier in the in the data life cycle.

NiFi

is designed to take advantage of all of the resources that are available to it. So you put it onto a server or a cluster,

and it's going to use all of that, power. Whereas Minify is designed to be a good neighbor.

It it sits on some other system that's not intended,

to be running it, and

it respects the,

the other

systems or,

software that's there. And so it's

really good for embedded devices,

IoT devices, things like that. It, is available in both Java and c plus plus. It's about a 100,

well, actually a 1000 times smaller,

than NiFi for the c plus plus version.

And it doesn't have a user interface, but it still lets you do a lot of the, small, you know, data operations

and extends your visibility, your insight, and your decision making capabilities

much farther into that, you know, weird ecosystem of lots of different data generators.

And so it sounds like maybe the best way to think about it is is as a

data integration

framework

with a high level interface to allow for easy self-service

by non programmers.

Some people see it as a a competitor for Kafka and we really don't because

I can't tell you how many times we do a deployment and it integrates directly to Kafka because that's,

the message bus that that the user has chosen.

So it works really well in conjunction. But Kafka, for example, like you have to have these connectors, you have to, you know, often write code

to integrate that to your other system or run a command line tool that's specifically built for another system.

Whereas, NiFi,

ideally, you can just drag a box onto the canvas and it will work with whatever your other source or follow on system is.

So we really

we try to make it as,

drag and drop easy as possible. And that UI, like you mentioned, is really helpful for that.

But just to try and, like I said, lower the barrier of entry,

to getting all that data connected and flowing. And the data that you're working with when you're building,

NiFi workflow,

is that primarily going to be routed through the same servers that NiFi is running on? Or is it,

more of a system that is going to maybe trigger operations

in other external systems, or is it more just a collection of both and dependent on whatever it is that you're trying to do? Yeah. So that's that's a great question.

Almost all of the data will flow through NiFi through the NiFi systems.

NiFi has the capability through either

execute script processors or shell commands or, you know, invoking HTTP endpoints

to trigger operations on other systems. But that, again, kinda goes back to the job or workflow orchestrator

role, which you can you can kind of force NiFi into that,

role, but there are other tools that are better for that. So NiFi is really,

more designed to

to be the steward of the data as it passes through the system and and move it from a to b as opposed to telling some external system move it from a to b. Because that way you get the data provenance feature, you get the statistical analysis features,

you get

the auditing,

you get reporting. All of those,

benefits come from NiFi being the 1 that moves the data itself. Yeah. I will just add to that. The way NiFi

reads and writes the data is through an abstraction.

So NiFi has various

repositories.

The main bulk of data is the flow file content repository,

so that's, you know, the actual

bits of the data

flowing through the system.

And, that repository

is an abstraction interface.

So

it's you know, even though data is on NiFi, it's on the same system that's running the NiFi server. That doesn't have to be, you know, like a physical file system. It often is, but if you're doing a containerized or cloud based deployments and you're in

an ecosystem where your cloud vendor can present network storage as local storage,

you know, something like, Kubernetes running in 1 of the major cloud vendors. That's a popular approach. So because we have these abstractions

for the various repositories,

it's pretty extensible in terms of what do I wanna use as as, you know, the storage of my of my architecture when I'm building out my infrastructure. So there's lots of ways you can configure,

something like a cloud based deployment for, know, how you're gonna set up your data storage along with your data processing and make that all work together.

And it will it will work with NiFi if you set it up correctly. Yeah. I would also say, I know we're you throwing a lot of terms around, and this sounds very,

complicated when it's nebulous and just, ethereal in somebody's head.

Go to nify.apatchy.org

and download it and start playing around with it, and a lot of this will will fall into place very And on the point

of the data provenance and tracking and monitoring features, those are definitely very

valuable aspects the system and things that are often,

secondary concerns of other tools or platforms that can be difficult to get right, particularly in terms of things like metadata tracking. So can you talk a bit about how that all integrates

in the NiFi system

and some of the reporting

capabilities

and alerting capabilities

that are built on top of those? Yeah. So Kevin mentioned the, the repositories that NiFi uses to store data. And like you said, the content repository stores

the actual arbitrary bytes that you're operating on. It could be CSV or JSON text. It could be video or images or binary data. Whatever it is, it's put into that repository, and there's basically a a content claim which is,

think of it like a c pointer. So it directs you to an address within that repository, but you just move that little pointer around throughout the system. You don't have to copy the data over and over when it's unnecessary.

And then there's a flow file repository. So flow file is that unit of information. So it's a series of key value pairs, which we call attributes, and those are small and lightweight and kept in memory so that they're very fast, and can be operated on at any time throughout the course of the the life cycle of that profile. And then it has the pointer to the data. And then there's a third repository called the provenance repository. Data provenance,

the term comes from from wine and art,

in history. It's being able to see the entire history of that, you know, piece of art or bottle of wine. But in this case, it's it's the data itself. So any operation that happens to that data, whether it's the flow file getting created at the beginning of the system to it being routed based on some attribute and decision making there, to transformation, encryption,

new attributes or attributes being modified, being sent to an external system, being viewed by a user in the user interface. Every event that happens creates another record for the provenance of that item, that that flow file. And so you can trace that. There's a very nice, user interface for tracking the lineage of data, but you can also export that information

into whatever format you want, and that gets

stored in that provenance repository. So you can also,

then treat that provenance data as raw data anyway. So you could do things like send it into another NiFi flow and treat it just like you would treat any other data ingested from some external system. So it really allows you, you know, you can export it to Apache Atlas for governance and tagging, or you could, you know, do transformations and visualize it in something like Kibana. I had an intern a couple years ago actually who used the provenance data in a machine learning model and did, like, k nearest neighbor algorithm and analyzed not only the flow efficiency,

various flows, but also could do anomaly detection and see, things like early field failure for, if hardware

the the latency started changing on certain parts of the flow. There's all kinds of applications from that. As far as the reporting, NiFi comes out of the box with some reporting features you can obviously

send to email or Slack or, you know, other, like, reporting things like that. But you can also have, you know, memory consumption

being reported on, CPU usage,

all kinds of, all kinds of reporting things that hopefully make a DevOps person's

life easier,

and integrate with whatever tool they're using to monitor their their deployments as well.

And further on the point of the

graphical UI being the primary means of interacting with the system, it can often be difficult to design that in

and flexible enough

to be able to actually be useful in a broad context because it's very easy to get stuck in the system of, I just want something that's very easy for somebody to get started with. And then once they get to a certain level of capability,

they start butting up against the limitations imposed by the UI.

Or you can go to the other extreme where you have a 1, 000 knobs and dials, and nobody ever understands

how to actually get that thing to work. And so it ends up being worse than just not having a UI at all. So what are some of the approaches that you use

to, sort of

balance between those extremes?

Yeah. Great question.

Especially because I think when you're new to NiFi, when anyone is new to NiFi,

the background that you have where your strengths are are gonna color how you view it. So some people, you know, they view the UI and the visual flow authorship,

and that's great for them.

You know, they

program when they have to, but they'd like to avoid it otherwise. Other people are very comfortable programming in many languages,

have a long history of scripting things out, working from a command line in a console, completely text based systems, and they're just more comfortable there. And the UI can be maybe a little bit of just

unnecessary in the way from the way they normally work or it's you know, even if it's not, once they start using it, they like it. It's great for getting started, but then it's not clear to them how this tool is going to work in their traditional

workflow, which is code based or

text based. It's a really exciting time right now for NiFi because I feel like it can actually,

span, you know, these types of ways of interacting with it

really well. So it's great when you're new to it because it's a visual tool. It has a UI. Documentation's right there.

So while it does have a little bit of a learning curve, you can jump right in

and start building flows that actually work, that actually do stuff. They accomplish the goal you you set out to do without stumbling over things like syntax and a syntax and a new DSL or

language,

you know, that you're unfamiliar with. So it's great to get started with. And then when you are ready to do something like codify your data flow, move it from maybe your local laptop that you're developing and testing on into a production server, the tools in the ecosystem

are there to help you do that very easily. So 1 recent subproject of NiFi

called NiFi Registry is

designed as an external and centralized storage location that handles flow definition storage,

flow versioning, flow diffing. It integrates directly with NiFi

and other tools in the ecosystem.

So, you know, think of it as, you know, a code repository

or an artifact repository

like, Nexus and Java. It's a place that when you're done with something, you can commit it or save it or mark it as a stable version of something you want to move somewhere else or have available as a copy later. And then over time as you build up those versions,

you know, you have these snapshots that you can get back to any point in time. So you make a change,

don't like it, you didn't, you know, lose a known good state where you're at before. So NiFi combined with NiFi registry is really powerful combination. It makes NiFi really fun to use because it's so tightly integrated. So whenever I'm working on a flow, you can just right click, save a version,

change version, and then if you have 2 NiFi instances pointed to the same NiFi registry,

you can easily,

you know, write from 1 NiFi and read from another 1. So I can push up to a NiFi registry from my local development machine, my laptop,

and then pull down that change on running server server in another environment whether that's a integration test environment, staging, production,

what have you. So while NiFi is really approachable with a UI at first, once you wanna do

automated

tasks

or

something that's

reproducible and reliable

or more codified.

It's all there under the hood. So, you know, flows get saved in,

in few various formats that are used under the hood that are deterministic

and completely specify your flow. So, you know, there's an XML format that's used by NiFi. Registry stores in a JSON format. So there are ways to get your flow definition into something text based and then from there, there's lots of tools that you can use with NiFi and NiFi Registry.

Both behind the UI have a very well documented and complete REST API that you can interact with, and there are lots of tools that interact with those APIs to make it even easier to

get started with, without calling the rest API directly. There's abstractions.

1 that's relatively new came out around the same time as NiFi Registry.

It's called NiFi CLI, and we can talk more about that. But it's a great way to start

automating some of your

operations work,

once you're using NiFi in a more traditional, like, enterprise or

industrial environment. As far as your your question about balancing the UI between beginner and and advanced or, you know, people who want to use the UI versus other ways to control Wi Fi,

I was gonna say we have a a great, designer, Rob Moran, who basically, anything you see in NiFi, he has done a ton of talking to users and making it easy.

And then,

Matt Gilman and Scott Aslan are 2 of the developers who can translate those ideas into,

actual responsive user interfaces that people enjoy using, which is not a skill I ever picked up. So happy that they're on their team. Yeah. I always say that I can make things work. I just can't make them pretty. Yeah. 1 of the questions I was about to ask as well is how you would manage something like a code review process that you would use with a typical software development life cycle. But if you're able to

sort of export

the data flows in these textual formats, then you could still use a lot of your standard tooling, such as Git and GitHub or whatever it is you might be using to version and codify these

definitions

and still put them through code review process before then promoting them up to a production system. So that definitely adds a lot of value as well. Yeah. So 1 of the tools I mentioned, NiFi registry, it actually does have,

like NiFi, it's very extensible, so you can

configure various providers for certain internal interfaces.

And for its persistence, you actually can configure it to use a git repository

with automatic syncing to a remote. So I can run an NiFi registry instance, have it save every version of every flow as a commit to a git repository that automatically gets pushed to a GitHub repository

or something like that. So there's there's definitely ways to work with these tools that are available. We're really trying to do is not dictate 1 workflow, but rather give you a set of tools that are maturing all the time that you can adapt

to whatever workflow your team or

organization is using. I found from my experience, it's a little bit different everywhere you go. So rather than being very opinionated about that,

we're trying to give our users a set of tools that are flexible enough to integrate with however they want to use,

Flows

alongside their traditional,

SDLC.

And

for people who are working on

integration points or adding functionality

to NiFi from external systems or building it into their existing platforms, what are some of those integration points that are available?

So out of the box, in version 1 dot 7 dot 0, which was just released, at the end of June, the MacFi comes with over 260

processors out of the box, and I think it's 48 controller services.

So it's almost the case that you're gonna have something

immediately that you can work with. But let's say you're working with something like a proprietary

format that your organization uses or some other custom custom thing that's not out there and NiFi doesn't have something for already. There's a few ways that you can integrate with that. NiFi is a very clearly defined framework and API,

and then all these extension points. So for example, writing a custom processor is a fairly easy task. There's a, a Maven archetype, which is available. So it gets you up and running, builds the skeleton of everything that you need and you just have to populate 1 method really.

Maybe some validation if you want to as well. But you can also

tighten that,

feedback cycle and use

what we call an execute script processor,

where you can just drop,

a script in

Groovy, Ruby, Python, Lua,

Closure,

or JavaScript,

into this execute script processor and it will run and it has access to all the context that a normal processor would have. So that will let you make you know, you can make changes directly to the code very, very quickly. All you do is start I'm sorry, stop, edit the script, and then start the processor again. So that's, I mean, on the order of magnitude, like seconds or minutes for a feedback loop as opposed to writing a custom processor, writing unit tests for it, deploying it,

building it, deploying it into a NiFi instance, restarting NiFi. Like, that'll get you onto the minutes,

to an hour

order of magnitude. So we recommend using the execute script to, like, prototype things and and find the behavior that you want and then codify that into a custom processor that you can deploy,

across all of your NiFi instances and have that, you know, functionality,

available for any user,

right away.

There's also you know, you can use my client to run a shell command,

and and pipe input and output. So let's say you have this legacy system that somebody wrote this compiled,

library object 20 years ago, and you have no visibility into what it's doing inside, but it's this critical piece of,

infrastructure that everything has to go through.

You can have a flow set up that's just delegating that portion of the work to the that shell command,

and getting the, you know, feeding the input and getting output and keep, keeping that into a flow that's managed.

And you can have that running while in the meantime, you might be, might have a team trying to modernize that,

side by side. So there are really a lot of ways that you can

again, our goal is just to make NiFi

easy to use the way that your system currently works and not have you have to rearchitect your entire organization

in order to adopt this tool, which isn't

adding much if you have to if you have all that added, cost. And what are some of

the use cases

or

project requirements that lend themselves well to being solved by NiFi?

As Andy mentioned, to, you know,

introduce NiFi,

it's

it's kinda designed to solve the generic data flow problem. I have data in motion. I wanna manage

how that, you know, how that data flows, how it moves through my system. So it it is really generic in that way, and, you know, what I what I would say is, even if you have a use case that might

on paper or at first look like it has very simple requirements

and

running something,

you know, as a server

may feel overkill.

You know, you just have to do a simple load file into a database. You can write it in any scripting language you're comfortable in. It's probably already on the Linux server. You know, I I would say chances are most projects requirements do grow and change over time. So, you know, reaching for a tool like NiFi, you know

it's probably gonna do whatever you need it to do. If it doesn't, it's extensible.

And, you know, as your requirements change and grow over time,

it's going to grow and scale with you so that when you have a very large complex data flow 6 to 18 months from now, the tool you started with is still very well suited for the job.

That's all I have. I don't know if you wanna add to that, Andy. Yeah. I think that's a great answer as far as the actual, like,

data

requirements.

I would say as far as, like, industries or or use cases,

we know that,

a lot of industries

really are relying on, you know, far more data than ever before to do just their fundamental, their core

functions. And so you have financial institutions that need to track transactions,

and they wanna they wanna add value and protect themselves by doing, like, fraud detection and pattern analysis.

We have telecommunications

customers with hospitals and health care providers

who need to track all of their patient data really carefully.

They need to integrate with legacy tracking systems and hardware devices.

Oil and gas is a huge 1,

and it's ironically very similar to the original use case in MiFi where you have all these distributed

systems that are collecting massive amounts of, like, sensor data. And on a rig, that can be life or death. I mean, everybody is familiar with, like, Deepwater Horizon and and, you know, these cases where

there's sensors and and being able to

have access to that data,

earlier on can literally save lives.

And so with the the oil rigs, you know, they throw a computer out there and put an IFI on it. And,

Chris Herrera and and a couple other people have, like, a a case study where it saved them over 200% on their data transmission.

They were able to prioritize

and detect,

these kind of, you know, machine,

sensor

data patterns

much earlier, and it was, it was great for them there.

And then things that are IoT.

NiFi and Minify in conjunction,

I think can absolutely

and I I don't use this word frequently, but revolutionize

IoT because

you just you have access to so much information.

You can perform

the necessary operations out at the edge. So there was 1 example with a connected car,

and we put a chip into the car that had minified running and could do all of this data exfiltration

and analysis

and protection and prioritization,

literally on the car before it ever transmitted anything back. And it saved massive amounts of money because it could prioritize

whether or not to use public Wi Fi versus LTE. And LTE is, you know, obviously

available much more frequently, but much more expensive to transmit data over. So just so many

circumstances

where

having insight into your data

and being able to manage it efficiently,

opens up brand new possibilities that that we may not even have considered yet. And that's why we wanna make this available to people who are the domain experts, and they're the ones coming up with these really cool ideas and features.

And this is just providing a platform to make it easier to do that. And what are some of the signals

that would suggest that NiFi is the wrong choice for a given project?

Yeah. Sure. I mean, we definitely believe that in data flow, it's it's

very unlikely in a lot of situations that 1 tool is gonna

solve everything for you. Chances are you're using

a collection of different tools that work well together. So NiFi is great to work with those other tools,

and there are definitely things if you find yourself using NiFi for them, probably,

wanna reconsider

or avoid it in the first place. So that would include

certain types of analytics. You can you know, I've seen plenty of data flows where people are doing some

light analytic work just right in line with NiFi. But if you wanna take, you know, advantage of some more advanced analytic features with things like sliding data windows

within that window. I wanna find

a max or minimum or an average,

you know, like a rolling window over time. Like, that's not built into NiFi.

I wouldn't recommend using it for that. There are tools out there that do things like that, like Spark Streaming.

So and there's a lot of tools in this space that might, at a glance, look similar because

they may even have a similar UI where you're building

a directed graph of components on a canvas. You know, or even if it's not visual, it's kinda text based, has the same paradigm of, I've got these processors and connections between them. But really under the hood,

they have been designed to do things differently

in terms of the trade offs that they make and how they're optimized to work. So you really need to look at your domain, and you need to think about

what are the types of expensive operations I wanna do or what are some of the complex modeling or algorithm

analytical work I need to do? And,

you know, is the tool I'm reaching for designed well to handle that case, or will I be fighting the tool in order to make it do that over time?

So, Yeah. You you have to be careful.

1 of the great things about NiFi is you don't have to just be tied to it. It's happy

to send it out to other systems and read it back in in that way. So it's a great complimentary tool even if it's not your core

data analytic or data store technology.

And for somebody who's looking to

get started with NiFi and deploy it in a production context, what are some of the considerations

that they should have in mind as they're

deploying and scaling the installation

and some of the system or network parameters that should be considered in the process?

So I I would say the do you want the easy answer or the the long answer? Because the easy answer is just pull Apache NiFi latest from Docker Hub, and,

you're good to go. And pull it a few more times if you need to scale. But the long answer is that there are a number of ways to to deploy and to scale. And our Our container story is not as strong as we would even like it to be. So there's ongoing work for Docker and Kubernetes improvements,

to really get into that

completely

dynamic scaling. As far as clustering,

we don't really recommend having more than 10 nodes in a cluster,

which can be

kind of different, for people coming from especially from like a Hadoop,

experience where you want to scale your cluster to be super huge. But that's because NiFi is doing data movement, data flow. So there's a lot of network communication that happens.

And the processing itself,

we found that,

you know, a handful of nodes really is powerful enough

to do

the actual processing that most of these flows take into consideration,

and overloading the too many nodes actually just increases the overhead.

So, yeah, if you have a cluster that's, you know, 4 or 5 machines

and they all have a few cores, some RAM, and a good 10 gig,

network card,

you're pretty much set up. You're you're at the level of

a giant enterprise,

deployment of NiFi.

And what have you found to be some of the most interesting or unexpected or challenging aspects

of building and maintaining the NiFi project and the associated community?

So I found in the year I've been in the year I've been working on the project, first of all,

the community is great. Probably the best software development community

I've ever been a part of or been involved with.

Really helpful. Lots of ways to ask questions and get questions answered. Everybody has a great attitude and is really helpful and welcoming. 1 of the things that has surprised me has just been,

from hearing people ask questions and describe their use case, like the complete

breadth of

various different uses people are employing NiFi.

It's

across industries,

across

use cases, everything from,

you know, individual hobbyists

and researchers all the way up to huge enterprises,

you know, just massive companies with massive data problems,

tiny little flows with a few processors all the way up to thousands of processors on a canvas.

So, you know, that's a really challenging,

design problem when you're working on a tool and you're thinking about trade offs,

you have to consider all those different use cases.

And, you know, a lot of times it comes with, you know, just allowing things to be configurable

and flexible

and choosing reasonable defaults.

Yeah. I think for me,

there definitely been a number of challenges.

1 was just adapting to open source, coming from a career where everything was very, very, close source. It was quite the opposite.

But

it's like Kevin mentioned,

there's so many different

people and organizations using this that, you know, on 1 day, I have a meeting with, an organization that is processing, like, a 1000000000

records a day using

the equivalent of 3 of 3 laptops,

in a cluster.

And then the next day, I see something on Twitter where

the Cincinnati Reds are using it for for some of their stuff. And so it's like,

it's

so across the board.

You get, you know, somebody I somebody I follow on Twitter for their information security expertise

just randomly said, you know, oh, I had this system and it was down. I was doing log ingest, and NiFi

was just magic. Like, it brought, you know, everything was queued up. And when the system came back online, I didn't lose any data. Like, I can't imagine doing this without NiFi in the future. And they had no idea that, like, I was, you know, I was working on it and I was following them. And it was it was just really cool to see that that it's actually helping people do really cool stuff,

across the board.

Yeah. And that brings up 1 of the other things that we didn't get into yet and

is the idea

of

recovery

and resilience and maybe back pressure

when you do have some of these failures or mutations or changes

either in source or destination,

data stores. So I don't know if you wanna just briefly discuss some of the capabilities

and built in

techniques that you've used for addressing some of those situations?

Yeah. I can touch on it briefly.

So 1 of the ones you mentioned, back pressure,

this was 1 of the moments for me when I first started working with NiFi when I realized, like, wow. This is a really

full featured, well thought out system that's really,

you know, it really understands the domain of this problem. So

between processors, you can set connections.

On a given processor connection. You can set a backpressure attribute, which is basically,

you know, how many things can be queued up to go into this input before it's considered full. And that's based on, you know, what you've tuned to be the processing capabilities of this step at your data flow.

And then that can propagate back

throughout the upstream systems. And you can set various,

you know, essentially, configurations

for what to do when

they start getting back pressure from the components they're feeding. So it can propagate back up to your queue, and, ultimately,

you do have to make a choice what to do with your data, but NiFi gives you the ability to configure that and make you know, put the logic into your data flow such that, like, when a queue is full, what do you wanna do with it? Do you wanna set a time penalty to send it more data or

things such things such as that? So,

you know, and and these are things that you don't have to,

necessarily

always worry about right away. As I mentioned before, like, wherever possible,

we try to configure an iFi out of the box so that has reasonable defaults. So if you don't wanna change these settings, you just leave them. But then the nice thing is, you know, after your flow has been running for a while or maybe your data size grows or you make a change to it, like, those settings are there when you need them. So it's a tool that

you don't need to fully understand it to get started with it. But then as you start digging into it, you realize, like, wow, I can just configure a few things and notify works the way I expect it to. I don't you know, something I I approach and think this might be really complicated to solve

Actually, it might just be a couple settings on a couple components,

and then the rest of the system responds

to do what you want it to do. And what are some of the

new features or improvements

or,

sort of broad ideas that you have planned for the future of NiFi?

That could be a whole whole new episode of the the podcast.

So our Wiki has,

a feature roadmap and that's kinda

directed by the community.

You know, it's it's truly open source software.

Anybody

who from, you know, Joe Witt, who's been

created this and working on it every day for 15 years

to somebody who

finds the the Wiki page tomorrow and,

you know, wants to start contributing.

You know, it's it's a very meritocracy of ideas type approach, and

we really do do try to evaluate,

you know, the quality of the of the idea itself and not where it's coming from. So,

there's all kinds of things that are gonna be coming up.

For myself, working on the security side of things,

I have this idea for, like, a TLS auto secure feature where you just kinda set it and forget it, and NiFi will continually

upgrade the Cypressuite selection and protocol version and all this kind of stuff based on Mozilla TLS observatory data so that it can,

keep you secure without you having to think about it on a daily basis.

There's encrypted implementations

of the various repositories.

There's

new, encrypting attributes and and recognizing some PII or PCI sensitive data, but protecting that automatically.

Kevin can talk about the new stuff that's coming with,

enhancements to Wi Fi registry and

and Minify. So

Yeah. Like Andy said, there's

so much that's possible in the future in Wi Fi. There's huge amount of ideas

and,

Jiras with feature proposals, things in the backlog and on the Wiki.

It's 1 of the things that makes working on the project really exciting is just

there's so much that

is still left to do and can be done on it to make it even better.

On the NiFi registry side,

it's still relatively new subproject, so there's a lot that can be done there.

1 of the

ideas that we have right now is,

2 things there. So 1,

making it so that an NiFi registry instance can be publicly hosted and shareable. So, you know, analogy might be,

Maven Central or Nexus repository

or Docker Hub for Docker images,

but a place that people can collaborate and,

you know, push up data flows or parts of data flows, you know, little modules of re reusable sections.

So NiFi, the term would be a process group where you can

group a subset of your flow on a canvas inside an element,

and then that now becomes something that you can save as an individual component up to a NiFi registry. So,

you know, enabling more features with NiFi registry

to support this type of collaboration and sharing, you know, it could just be something you run on prem and just share amongst your team or could even be

publicly hosted out on the Internet where,

people can publish

little reusable data flows that other people can pull down and do, you know, that

composite together

a lot of steps

that the simple processors do into something larger and more useful.

Also on NiFi registry,

you know, just being able to store things other than flow. So we touched on earlier in the podcast

the ability to write custom processors and,

you know,

you can start a Java project to do that. The output of that is a NiFi

archive format called a NARV,

a NAR.

So being able to

version those artifacts in the same place you're versioning your data flows in a NiFi registry

and

then

with NiFi, the ability to pull them in and load them at runtime,

rather than installing them into NiFi's lib directory.

So just, you know, think about just making it more flexible. The analogy here would be like a package manager for other systems, but a central place that you can publish,

binary components that then a NiFi or Minify instance

running anywhere that's connected and has visibility to that

can, you know, look them up by ID and version number and pull them down and and run them. So just making it more flexible

with how you deploy

these things, making parts of your NiFi data flow more modular and shareable.

So this is going on the NiFi registry front. On the mini Minify front, there's a ton of work being done right now on the Minify c plus plus agent, which is an alternative to the Java 1. It's designed to get you

all the way down into really tiny footprints of memory and processing.

Really modular. So, again,

you can pick and choose the parts of it that you want. It's extensible,

so you can run it

on a device, on an IoT device,

where you might have custom sensor

or USB

devices,

peripherals,

cameras,

sensors, what have you, and the ability to integrate directly into that with just a tiny little Minify agent that's running natively on the device

and pulling all that data right on-site,

maybe doing some local processing with it there or just tagging it with NiFi provenance data and sending it back to

a central data center for the bulk of your processing,

which might be, you know, sending it off to another NiFi instance,

but getting the data provenance all right from the point of time it originated. So,

there's a ton going on with

Minify,

NiFi, NiFi Registry.

Andy and I can't speak to this much, but

if you're into front end development,

1 of the things that started there is

NiFi flow design system, which is a Angular based

UI

UX design system

of reusable components

that get started as it's really, you know, just like a library

for the front end work so that we can start to

put some of those UX UI components in 1 place, the theming,

the interaction

as these reusable pieces, then we can use to make our interfaces across various systems like NiFi and NiFi Registry more consistent. So if you're listening and you're a front end developer and that sounds like your thing, like, getting in on the ground floor of this awesome new

design system

that's gonna be used in all these different NiFi tools we've been talking about.

Check it out. It's a sub repo. It's on

GitHub or Apache dotorg. Go to nify.apache.org,

and you can read about how to get involved with all this stuff.

Yeah. The UI was actually something that when I left my previous employer to go work on this, 1 of the engineers looked at it. They came to me and goes, wow. You know, I've been trying to do this,

this user interface type stuff in my side project, and I've never been able to build something like this. So I'm gonna steal all your code. It's like, well, it's open source. I mean, you're you're able to you're able to use it and,

hopefully, you know, power whatever cool features and ideas somebody else has too. So

So

for anybody who wants to get in touch or follow the work that you're up to, I'll have you each add your preferred contact information to the show notes.

And as the final question,

from your perspective, what do you view as being the biggest gap in the tooling or technology for data management today?

Yeah. 1 thing that's a big challenge today is, as I mentioned, chances are if you have

a real

significant data problem that you're trying to solve with a tool like NiFi

or the other tools that have been covered

on your podcast, Tobias,

chances are you're you might be using more than 1 of them, or you might be

using 1 and then migrating to a different 1.

And,

you know,

1 of the

rising challenges,

in this space is,

how do you deal with data governance and data lineage and

applying the right

policies, which might not be technical limitations, but might be

policy limitations or

regulatory

limitations or law limitations to what you're allowed to do with data.

It can leave your country's

borders or if it needs to be anonymized before doing that. Right? It's a big topic right now with things like GDPR

and a lot of attention being put on data

utilization

in technology.

So while there are tools out there that are being developed

and admirably

trying to solve this problem, tools like Apache Atlas, which NiFi integrates, is 1 that's going after this problem.

I think there's still

as the policies and regulations evolve, I think the tooling is gonna have

to catch up and stay current with

being able to easily adhere to those sorts of things.

So my the biggest gap that I see

is and this is, a 100% biased because this is what I deal with on a regular basis, but security.

I still think that

the the fundamental

approach to security is not as widespread and as,

strict as it needs to be. And security is often looked at as a bolt on or a follow on

responsibility,

after the the core work of getting data from x y or,

doing some analysis.

And

it's always too expensive to to deal with it upfront, but that's far, far cheaper than reacting to a breach, or some other kind of incident. And as we see, I mean, you know, just ever increasing,

each week, something's happening.

So I think that, you know, NiFi is a good start in that, but certainly there's room for NiFi to improve as well. And just to,

make it easier for users to have a secure deployment because,

it's not fair to just keep saying, oh, the users didn't set this up or the users didn't take this into consideration

when the tools are not easily

configured or available or

understood. So I think that's a responsibility

on,

you know, us, the the creators and developers of these tools, is to make security,

so easy to use that there's no excuse not to. So that would be my answer.

Yeah.

Great point. I totally agree. And I also think that's the reason that if you're trying to solve these problems, there's an advantage to

using,

you know, community or industry accepted tool, something from open source

or a vendor tool where somebody is solving this problem for you. Because if you try to solve all these problems yourself and build a solution yourself completely in house, like, unless you're a huge company with a huge, you know,

engineering

team,

chances are

you're gonna miss something.

So you've got

real advantages

to trying to find a tool out there that you know is taking these considerations

really seriously. That's a good point, Andy.

Alright. Well, I want to thank the both of you very much for taking the time today to join me and discuss the work that you're doing on NiFi.

It's definitely a very interesting project and 1 that I'm excited to see future development on.

When I first came into this, I wasn't quite sure exactly what the platform was and what use cases it was targeting, so it's been very informative. So thank you both for that, and I hope you enjoy the rest of your evening.

Yeah. Thanks for having us on the show. Hey. Thanks, Duas. Really appreciate

it.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links