Flume installation is a very simple process. Go through the following steps to install and configure Apache Flume.
Steps for Flume Installation
- Download the Apache flume from the Apache Webpage. (Apache Flume Download Url)
- Extract the folder from the zip file that is downloaded and point to the flume folder in bash profile. The entry to the bash profile is to make sure that you can start the flume agent from any directory. For Example : export FLUME_HOME = $HOME/apache-flume-1.6.0-bin
- Append the path variable with FLUME_HOME. For example, export PATH=$PATH:$FLUME_HOME/bin
Flume Agent
A flume agent is an independent java daemon (JVM) which receives events (data) from an external source to the next destination. The next destination can be an agent itself or it can be a sink. A flume agent can connect to any number of sources to any number of data storesin big data.
The next destination can be an agent itself or it can be a sink. A flume agent can connect to any number of sources to any number of datastores.
Let’s understand this with an example:
If you have two data sources say DS1 and DS2 and you want to write DS1 data into HDFS and DS2 data into Cassandra.
For this scenario, one flume agent is enough to complete the job. A flume agent does hop by hop operation i.e. writing the data of DS1 and DS2 to HDFS and Cassandra is one hop complete.
Let us suppose you have another hop for the data i.e. data that is written to the HDFS is read by some other application and finally it needs to go to some other datastore say hive.
Here, there are two flume agents required since we have two hops of data. One flume agent for DS1 and DS2 data to HDFS, Cassandra respectively and the other flume agent from HDFS to Hive.
There are three basic components of a flume agent:
- The first component to receive data. This component is called as the source.
- The second component to buffer data. This component is called as the channel.
- The third component to write data. This component is called as the sink.
A flume agent is set up using a configuration file and within that configuration file, we have to configure about the sources and the format in which data is sent to your source.
It is important to configure the channel capacity as per the rates of the source and the sink. There are many types of channels but the two are most commonly used channels.
First, is called an in-memory channel which uses the memory of the system where flume agent is running and buffers the data where in-memory. The in-memory channel basically acts like an in-memory queue.
The source will write to the tail of the queue and the sink will read from the head of the queue.
But there are issues with the memory channel. One primary issue is that it is capacity constraint by the amount of memory system has. In case of the crash, memory channel is not persistent.
Therefore, all the data that is present in the buffer might be lost. File channels are the best because it gives you fault-tolerance and non-lossy data i.e. you will get a guarantee of no data loss.
Since the data is buffered on disk for file channels, you can have larger buffer capacity as per your requirement. Channels are continuously polled by sink components which write the data to endpoints.
Multiple sources can write to the single channel. Also, one source can write to multiple channels i.e. there is a many-to-many relationship between sources and channel.
However, channel to sink relationship is one to one. The channel will not immediately delete the data as it writes to the sink. It will wait for an acknowledgment from sink before deleting any data from the channel.