Apache Flume is basically a tool or a data ingestion mechanism responsible for collecting and transporting huge amounts of data such as events, log files, etc. from several sources to one central data store. Apache Flume is a unique tool designed to copy log data or streaming data from various different web servers to HDFS.
Apache Flume supports several sources as follows:
- ‘Tail’: The data is piped from the local files and is written into the HDFS via Flume. It is somewhat similar to a Unix command, ‘tail’.
- System logs
- Apache logs: This enables Java applications for writing events to files in HDFS via Flume
Features of Flume
Before going further, let’s look at the features of Flume:
- Log data from different web servers is ingested by Flume into HDFS and HBase very efficiently. Along with that, huge volumes of event data from social networking sites can also be retrieved.
- Data can be retrieved from multiple servers immediately into Hadoop by using Flume.
- Huge source of destination types is supported by Flume.
- Based on streaming data flows, Flume has a flexible design. This design stands out to be robust and fault-tolerant with different recovery mechanisms.
- Data is carried between sources and sinks by Apache Flume which can either be event-driven or can be scheduled.
Refer to the image below for understanding the Flume architecture better. In Flume architecture, there are data generators that generate data. This data that has been generated gets collected by Flume agents. The data collector is another agent that collects data from various other agents that are aggregated. Then, it is pushed to a centralized store, i.e., HDFS.
Let’s now talk about each of the components present in the Flume architecture:
- Flume Events
The basic unit of the data which is transported inside Flume is what we call an Event. Generally, it contains a payload of the byte array. Basically, that can be transported from the source to the destination accompanied by optional headers.
- Flume Agents
However, in Apache Flume, an independent daemon process (JVM) is what we call an agent. At first, it receives events from clients or other agents. Afterward, it forwards it to its next destination that is sink or agent. Note that, it is possible that Flume can have more than one agent.
- Flume Source: Basically, Flume source receives data from the data generators. Then it transfers it to one or more channels as Flume events. There are various types of sources Apache Flume supports. Moreover, each source receives events from a specified data generator.
- Flume Channel: A transient store that receives the events from the source also buffers them till they are consumed by sinks is what we call a Flume channel. To be very specific it acts as a bridge between the sources and the sinks in Flume. Basically, these channels can work with any number of sources and sinks are they are fully transactional.
- Flume Sink: Generally, to store data into centralized stores like HBase and HDFS we use the Flume sink component. Basically, it consumes events from the channels and then delivers it to the destination. Also, we can say that the sink’s destination is might be another agent or the central stores.
- Flume Clients
Those who generate events and then sent it to one or more agents is what we call Flume Clients.
Now, that we have seen in-depth the architecture of Flume, let’s look at the advantages of Flume as well.
Advantages of Flume
Here are the advantages of using Flume −
- Using Apache Flume we can store the data into any of the centralized stores (HBase, HDFS).
- When the rate of incoming data exceeds the rate at which data can be written to the destination, Flume acts as a mediator between data producers and the centralized stores and provides a steady flow of data between them.
- Flume provides the feature of contextual routing.
- The transactions in Flume are channel-based where two transactions (one sender and one receiver) are maintained for each message. It guarantees reliable message delivery.
- Flume is reliable, fault-tolerant, scalable, manageable, and customizable.
Now, we come to the end of this tutorial on Flume. We learned about Apache Flume in depth along with that we saw the architecture of Flume.
It allows the distribution of processes to organize with each other through a shared hierarchical name space of data registers.
- Zookeeper Service is replicated or duplicated over a set of machines.
- All machines save a copy of the data in memory set.
- A leader is chosen based on the service startup
- Clients is only connected to a single Zookeeper server and keep a TCP connection constantly.
- Client can read from any Zookeeper server then writes go through the leader and requires the majority consensus.
It is an open source platform based on Web interface for analyzing the data with Hadoop and Spark. It is a series of application consisting of executing queries, copying files, building workflows.
Features of Hue
It following features are as follows–
- Spark Notebooks
- Wizards to import data onto Hadoop
- Dynamic search dashboards are required for Solr
- Browsers are required for YARN, HDFS, Hive table Metastore, HBase, ZooKeeper
- SQL Editors are implemented for Impala, Hive, MySql, Sqlite, PostGres, Sqlite and Oracle
- Pig Editor, Sqoop2, Oozie workflows Editors and Dashboards