This tutorial outlines the need and importance of Apache Flume and Apache Sqoop. There are quite a few number of data stores in the market today. Among them, the most popular ones are HDFS, Hive, HBase, MongoDB, Cassandra.
Advantages of Distributed Data Stores
All of them are open source technologies, but the key question to think about is, why are these data stores so popular?
All of these data stores are distributed in nature i.e. they use the cluster of machines to scale them linearly.
The other advantage of these data stores is that you can use one single system for both transactional and analytical data.
But the main question here is where does this data come from? General, the data come from two kinds of sources.
Either it is an application that is producing the data such as user notifications, weblogs, sales etc produces a large amount of data on regular basis.
Therefore, we need a datastore to store this kind of data.
The other kind of data source is a traditional RDBMS data. This could be OracleDB, MySQL, IBM DB2, SQL Server etc.
Steps to port RDBMS data to HDFS
Let us suppose we have a requirement to port the archived RDBMS data to HDFS datastore. The important thing to understand is that how do we get the RDBMS or Application data into HDFS.
Normally, Hadoop Ecosystem exposes JAVA API’s to write data into HDFS or different data stores. Therefore, we have JAVA API’s for HDFS, HBASE, Cassandra etc.
We can directly use these API’s to write the data to HDFS, HBase, Cassandra etc. But there are problems when you are streaming from an application or bulk transferring the data from tables in RDBMS.
Let us suppose we have an application that sends “user notifications” for a professional network and then tracks metrics such as the number of views or number of clicks on user notifications.
We have the number of events that are producing data for example:
- Creating a notification is an event
- The user reading the notification
- The user clicking on a notification produces data
This data needs to be stored as the event occurs. The moment this data gets generated, it needs to be sent to the data store. This is called streaming data.
Let us suppose we want to write this data to HDFS. For this, firstly we need to integrate my application with JAVA API’s of HDFS data store.
Secondly, we need to devise a mechanism to buffer these streaming data. It is important to note that HDFS stores the data in the form of files which is distributed across the cluster of commodity server in terms of the block of data say 64MB, 128 MB chunk etc.
The metadata about these data blocks is written into one high-end server which we call as NameNode. If there is a number of small files then we are creating an extra overhead to NameNode to keep the metadata information for more number of files.
If we have the small amount of data then we are not taking the advantage of HDFS which has the cluster of commodity server.
Now, we have a challenge here, how do we create large size files with the small number from of streaming data. One way is to keep HDFS file open and keep writing into the same file until it is big enough.
But, this is not a good idea because if you will keep the file open for longer duration there are a lot of chances that file become corrupt or data being lost in case there is a failure.
Therefore, the solution is the buffer, i.e either you have an in-memory buffer or intermediate file before we write to HDFS.
It is not just the buffer mechanism, you need to create a single large file but the buffer layer should be fault tolerant and non-lossy.
Therefore, we can say that writing an application which buffers the data and integrating with the JAVA API’s of the various data store is a time-consuming and costly process.
Similarly, when you want to port your RDBMS data to HDFS, the similar kind of challenges will occur.
So, directly using the JAVA API’s is a very tedious process and hence the solution is flume Hadoop and Sqoop Hadoop which are the technologies developed to isolate and abstract the transport of data between a source and a data store.
Apache Flume
Apache Flume and Apache Sqoop are open source technologies developed and maintained by Apache foundation.
Apache Flume acts as a buffer between your streaming application and data store. Apache Flume can read data from different types of sources such as HTTP, a file directory, Syslog messages etc. and write data to many types of sinks e.g. HDFS, HBase, Cassandra etc.
Once we define the source and the sink then we will put Apache Flume in between to transport the data between application and data store.
Apache Flume will buffer the data based on the requirement of the source and the sink.
The source might have different rates at which it writes the data and the sink might have the different rate at which it reads the data.
Different sources might produce data at different rates and in different formats. Flume Hadoop can read the different sources at the same time. Therefore Flume Hadoop will able to deal with different rates and different formats.
Sink might require data to be written in particular format at a particular rate. Apache Flume comes with built-in handlers for common sources and sinks.
Apache Flume takes care of fault tolerance and guarantees no data loss for certain configurations. Apache Flume is a push-based system meaning it will decide when the data should be written to sink.
Apache Sqoop
In a pull-based system, the sink would subscribe a system that produces the data and pull the data at a regular frequency.
Whenever you add multiple sinks then the configuration of flume must change while as in case of the pull-based system configuration of a system that is writing will not change only a new system will start subscribing when it is added as part of the sink. Apache Kafka is an example of the pull-based system.
Apache Kafka is an example of the pull-based system.
Apache Sqoop is a command line tool which can directly import data to RDBMS from Hadoop layer.
Therefore, Sqoop is a pull-based system used for bulk import, not the streaming data.
Sqoop comes with the connector for many popular RDBMS. With the help of Sqoop, you can import the entire tables from RDBMS or the result of specific SQL Queries. You can also schedule periodic imports from Apache Sqoop jobs.