Data Pipelines with Apache Nifi
Quickly moving new data sources into a Big Data environment is one of the challenges we hear about most from our clients. It’s one thing to write some code that gets data from an API and then move it into the data environment, it’s another to make it an operational process that the end user (the business) can rely on. This article will look at considerations and lessons learned for using one of Twisted Pair Labs’ favorite heavy lifters in this problem space, Apache Nifi.
What is Apache Nifi?
Apache Nifi is an open source project originally developed by the NSA to manage very large streams of data. It was open sourced as part of the NSA tech transfer program.
Why Apache Nifi?
There are many complex technical challenges to managing the flow of data in and around a big data environment, this article will highlight Nifi as a solution for three of them.
Low Code Development
Nifi provides a drag n’ drop interface that allows you to build data flows without writing code. Nifi is built around the concept of Processors, which are code modules that serve a number of different roles. Processor can perform a number of operations:
- Fetch and send data to/from other systems using common protocols, technologies, and platforms like HTTP, SQL, FTP, HDFS, Kafka, AWS SQS, Twitter API
- Transform, clean, and enrich data including GeoIP
- Securely listen for incoming data from a variety of technologies
Scalability and Data Process Operationalization
If you have ever written a data import or export process you know that a major challenge is scalability. It’s easy to read a 10Mb file from system A and put is somewhere on system B. What about when you need to read Tb of data and then process the 100,000,000 rows in such a way that does not tank the system? Nifi has two features that can help; clustering and back pressure. Nifi can be configured to break up your workloads across several nodes and then you can define the process steps to have back pressure so that data flows cannot move to the next process until that process is below its back pressure limit. This keeps you from overloading the server and makes your operational data processes more reliable.
Nifi was built with the idea that developers would extend it with their own processors. The beauty of this approach is that if you build a processor within the Nifi structure you get a huge amount of benefit by having it be manageable through the Nifi infrastructure, UI, scalability, and observability.
An example of a custom Nifi Processor for a Big Data environment flow might be a process that supports communication with your business’ legacy systems through a proprietary protocol. I worked for a large mutual fund processing system that had just such a protocol. The technology was referred to a “View” and was basically a TCP port to which you sent a buffer of string data that had to adhere to a fixed format; the first 10 bytes were client name, the next 25 bytes were client ID, etc. Using Nifi you could write a processor that knew how to speak that legacy language and incorporate data from that legacy system into a big data environment without a great deal of effort.
Free as in Apache “Free”
Apache Nifi is open source, and therefor “free” to use. It’s an Apache product/project, and so like all Apache projects what you gain in cost of using open source can sometimes be eaten up in other types of costs (challenges managing the server, usability issues, generally cryptic techie concepts, learning curve). That said, Nifi is a great tool that can support a wide range of data pipeline use cases assuming you are willing to put in the effort to learn it.