13 September 2017
When I was working with Kafka I did a lot of research on event-driven messaging and event based architecture .While doing this research I stumbled upon Apache NiFi ,which helps to create complex data flows for a distributed or Internet of Things(IOT) based application .I decided to do this write up which introduces NiFi which will be a key player in the IOT based applicatios in the future .
Apache NiFi is an open source tool for automating and managing the flow of data between systems (Databases, Sensors, Hadoop, Data Platform and other sources). It solves the problem of real-time collecting and transporting data from multitude data sources and also provides interactive user interface and control of live flows with full and automated data provenance.
It is data source agnostic, supporting disparate and distributed sources of differing formats, schemas, protocols, speeds and sizes such as machines, geo location devices, click streams, files, social feeds, log files and videos and more. It is configurable plumbing for moving data around, similar to how Fedex, UPS or other courier/ delivery services move parcels around. And just like those services, Apache NiFi allows you to trace your data in real-time, just like you could trace a delivery.
This project is written using flow based programming using Java and provides a web-based user interface to manage data flows in real-time. NiFi provides the data acquisition, simple event processing, transport and delivery mechanism designed to accommodate the diverse data-flows generated by a world of connected people, systems, and things. This project was a classified project United States National Security Agency (NSA) for 8 years and was named as Niagrafiles. NSA made this application open-source through Apache Source Foundation in 2014 via its technology transfer program.
NiFi is used for data ingestion to pull data into NiFi, from numerous different data sources and create FlowFiles. It can process extremely large data ,extremely large data sets ,extremely small data with high rates and variable sized data. It can be used for various use cases some of which are given below.
Both Apache NiFi and Apache Kafka provide a broker to connect producers and consumers but they do so in a way that is quite different from one another and complementary when looking holistically at what it takes to connect the enterprise. With Kafka the logic of the dataflow lives in the systems that produce data and systems that consume data. NiFi decouples the producer and consumer further and allows as much of the dataflow logic as possible or desired to live in a broker itself.This is why NiFi has interactive command and control to effect immediate change and why NiFi offers the processor API to operate on, alter, and route the data streams as they flow. It is also why NiFi provides powerful back-pressure and congestion control features. The model NiFi offers means you do have a point of central control with distributed execution, where you can address cross cutting concerns, where you can tackle things like compliance checks and tracking which you would not want on the producer/consumers. There are of course many other aspects to discuss but sticking to the ideas raised in the thread so far here is a response for a few of them.
In terms of this data ingestion Pattern ,Kafka producers push data to Kafka broker and Kafka consumer pull data from Kafka broker.Though it is a clean and scalable model ,it requires that system to accept and follow that protocol. In Contrast NiFi does not have that specific protocol . It supports both push/pull data ingestion pattern to get data in and out of NiFi
On the data plane NiFi does not offer distributed data durability today as Kafka does. As Lars pointed out the NiFi community is adding distributed durability but the value of it for NiFi’s use cases will be less vital than it is for Kafka as NiFi isn’t holding the data for the arbitrary consumer pattern that Kafka supports. If a NiFi node goes down the data is delayed while it is down. Avoiding data loss though is easily solved thanks to tried and true RAID or distributed block storage. NiFi’s control plane does already provide high availability as the cluster manager and even multiple nodes in a cluster can be lost while the live flow can continue operating normally.
Kafka offers an impressive balance of both high throughput and low latency. But comparing performance of Kafka and NiFi is not very meaningful given that they do very different things. It would be best to discuss performance tradeoffs in the context of a particular use case.