Key-Value Database – NoSQL Key Value, Application and Examples

Key-Value database has a Big Hash Table of keys and values which are highly distributed across a cluster of commodity servers. Key-Value database typically guarantees Availability and Partition Tolerance.

The key-value database trades off the Consistency in data in order to improve write time.

The key in the key-value database can be synthetic or auto-generated which enables you to uniquely identify a single record in the database. The values can be String, JSON, BLOB etc.

Among the most popular key-value database are Amazon DynamoDB, Oracle NoSQL Database, Riak, Berkeley DB, Aerospike, Project Voldemort, IBM Informix C-ISAM.

Application of Key-Value Database – NoSQL Key Value

Let us take some real-life examples where the key-value database is utilized and the benefits they provide.

Managing Web Advertisements

Key-Value databases are mainly used by web advertisement companies.

User’s activity is tracked on web-based, language and location. On the basis of users online activity, web advertisement companies decide which advertisement to show to the user.

It is also important to note that serving advertisement should be fast enough.

It is important to target right advertisement to the right customer in order to receive more clicks and hence to maximize the profits.

Combination of factors such as user’s tracked activity online, language and location determine what a user is interested in forms the key while as all other factors that are needed to serve the advertisement better is kept as the value in key-value databases.

User’s session data retrieval

Your website needs to be efficient and fast to give a user the best service.

How much efficient your database is, if your website runs slow then from a user perspective your entire service is slow.

Websites primarily go slow because of user’s session are handled poorly. Instead of caching the information if every request requires opening a new session then the website will go slow.

User interactions with the website are tracked by the website cookies.

A cookie is a small file which has a unique id that can act as a key in key-value databases. The server uses the cookies to identify the returning users or a new set of users.

The server needs to fetch the data quickly by doing a lookup on cookies. The cookies will give the information about which pages they visit, what information they are looking for and about user’s profile etc.

Key-value stores are, therefore, ideal for storing and retrieving session data at high speeds. The unique Id generated by cookies act as a key while as the other information such as user profiles act a value.

NoSQL Database Types – Introduction, Example, Comparison and List

In this post, you will learn about NoSQL databases types and basic features of different NoSQL database types. NoSQL databases can be broadly categorized into four categories.

Key-Value databases
Document Databases
Column family NoSQL Database
Graph Databases

NoSQL Database Types Introduction

Let’s go through the short introduction and understand the features of all these NoSQL database types below. NoSQL databases are widely used in Big Data and provide operational intelligence to users.

Key-Value databases

It has a Big Hash Table of keys and values which are highly distributed across a cluster of commodity servers. Key-Value databases typically guarantee Availability and Partition Tolerance.

Key-value databases trade off the Consistency in data in order to improve write time.

The key can be synthetic or auto-generated which enables you to uniquely identify a single record in the database. The values can be String, JSON, BLOB etc.

Among the most popular key-value databases are Amazon DynamoDB, Oracle NoSQL Database, Riak, Berkeley DB, Aerospike, Project Voldemort, IBM Informix C-ISAM.

Document Databases

The main concept behind document databases is documents which can be JSON, BSON, XML, and so on. Document databases store documents and retrieve documents.

The data structure defined inside the document databases is hierarchical in nature which can be a scalar value, map or a collection. It is similar to a key-value database but the only difference is that the document database stores the data in form of a document which embeds attribute metadata associated with the stored content.

Every document databases use their own file structure to store data. For example, Apache CouchDB uses JSON to store data, javascript as its query language and HTTP protocol for its API’s.

Among the most popular document databases are MongoDB, Informix, DocumentDB, CouchDB, BaseX.

Column family NoSQL Database

Column family NoSQL database is another aggregate oriented database.

In column family NoSQL database we have a single key which is also known as the row-key and within that, we can store multiple column families where each column family is a combination of columns that fit together. Column family as a whole is effectively your aggregate. We use row key and column family name to address a column family.

It is, however, one of the most complicated aggregate database but the gain we have in terms of retrieval time of aggregate rows. When we are taking these aggregates into the memory, instead of spreading across a lot of individual records we store the whole thing in one database in one go.

The database is designed in such a way that it clearly knows what the aggregate boundaries are. This is very useful when we run this database on the cluster.

As we know that aggregate binds the data together, hence different aggregates are spread across different nodes in the cluster.

Therefore, if somebody wants to retrieve the data, say about a particular order, then you need to go to one node in the cluster instead of shooting on all other nodes to pick up different rows and aggregate it.

Among the most popular column family NoSQL databases are Apache HBase and Cassandra.

Graph Databases

Graph databases store data in the form of the graph.

Let us try to understand what the graph is. A graph is a mathematical model used to establish a relation between two objects.

We will discuss the whole concept of graph database taking Neo4j as the base database.

Neo4j is an open source NoSQL graph database implemented in JAVA and Scala. The source code is available on GitHub and is used by companies such as Walmart, eBay, LinkedIn etc.

CAP Theorem – Brewer’s Theorem | Hadoop HBase

In this post, we will understand about CAP theorem or Brewer’s theorem. This theorem was proposed by Eric Brewer of University of California, Berkeley.

CAP Theorem or Brewer’s Theorem

CAP theorem, also known as Brewer’s theorem states that it is impossible for a distributed computing system to simultaneously provide all the three guarantee i.e. Consistency, Availability or Partition tolerance.

Therefore, at any point of time for any distributed system, we can choose only two of consistency, availability or partition tolerance.

Availability

Even if any of one node goes down, we can still access the data.

Consistency

You access the most recent data.

Partition Tolerance

Between the nodes, it should tolerate network outage.

The above of the three guarantees are shown in three vertices of a triangle and we are free to choose any side of the triangle.

Therefore, we can choose (Availability and Consistency) or (Availability and Partition Tolerance) or (Consistency and Partition Tolerance).

Please refer to figure below:

Relational Databases such as Oracle, MySQL choose Availability and Consistency while databases such as Cassandra, Couch, DynoDB choose Availability and Partition Tolerance and the databases such as HBase, MongoDB choose Consistency and Partition Tolerance.

CAP Theorem Example 1: Consistency and Partition Tolerance

Let us take an example to understand one of the use cases say (Consistency and Partition Tolerance).

These databases are usually shared or distributed data and they tend to have master or primary node through which they can handle the right request. A good example is MongoDB.

What happens when the master goes down?

In this case, usually another master will get elected and till then data can’t be read from other nodes as it is not consistent. Therefore, availability is sacrificed.

However, if the write operation went fine and there is network outage between the nodes, there is no problem because the secondary node can serve the data. Therefore, partition tolerance is achieved.

CAP Theorem Example 2: Availability and Partition Tolerance

Let us try to understand an example for Availability and Partition Tolerance.

These databases are also shared and distributed in nature and usually master-less. This means every node is equal. Cassandra is a good example of this kind of databases.

Let us consider we have an overnight batch job that writes the data from a mainframe to Cassandra database and the same database is read throughout a day. If we have to read the data as and when it is written then we might get stale data and hence the consistency is sacrificed.

Since this is the read heavy and write once use case, I don’t care about reading data immediately. I just care about once the write has happened, we can read from any of the nodes.

But Availability is one of the important parameters because if one of the nodes goes down we can be able to read the data from another backup node. The system as a whole is available.

Partition tolerance will help us in any network outage between the nodes. If any of the nodes goes down due to network issue another node can take it up.

NoSQL databases – Introduction, features, NoSQL vs SQL

NoSQL databases are a non-relational database management system, which is cluster-friendly, designed for the large volume of distributed data stores.

Relational Model follows the de facto standard for database design which uses primary and foreign keys relation in order to store or manipulate data.

However, with the growing volume of unstructured data in distributed computing environment, relational model does not suites well. Relational models were not built to take advantage of commodity storage and the processing power available in today’s arena.

As the data volume grows in size, it is difficult to store the data into single node system which relational model adhere to. This gives the birth of commodity storage where the large cluster of commodity machine interacting each other in distributed fashion.

SQL (Structured Query Language) is designed to work with single node system. It does not work very well with the large cluster of storage. Therefore, top internet companies such as Google, Facebook and Amazon started looking solution to overcome the drawback of RDBMS.

This inspired the whole new movement of databases which is the “NoSQL” movement.

NoSQL databases do not require the fixed schema and typically it scales horizontally i.e. addition of extra commodity machine to the resource pool, so that load can be distributed easily.

Sometimes we create the data with several levels of nesting which is highly complicated to understand. For example geo-spatial, molecular modeling data.

Big Data NoSQL databases ease the representation of nested or multi-level hierarchical data using the JSON (JavaScript Object Notation) format.

NoSQL Databases Features

Lets got through some of the key NoSQL database features and how it is different from the traditional databases.

Schemaless databases

Traditional databases require pre-defined schema.

A database schema is basically the structure that tells how the data is organized, relations among database objects and the constraint that is applied to the data.

While in NoSQL database there is no need to define the structure. This gives us the flexibility to store information without doing upfront schema design.

Therefore, a user of NoSQL databases can store data of different structures in the same database table. That’s why; these databases are also sometimes referred as “Schema on Read” databases.

That means data is applied to a plan or schema only when it is read or pulled out from a stored location.

Non-Relational

NoSQL databases are non-relational in nature. It can store any type of contents.

There is no need to apply the data modeling techniques such as ER modeling, Star modeling etc.

A single record can accommodate transaction as well as attribute details such as address, account, cost center etc.

Non-Relational doesn’t fit into rows and columns and is mainly designed to take care unstructured data.

Distributed Computing

You can scale your system horizontally by taking advantage of low-end commodity servers.

Distribution of processing load and the scaling of data sets is the common features of many NoSQL databases.

Data is automatically distributed over the cluster of commodity servers and if you need further improvement to the scalability, you can keep adding the commodity server in the cluster.

Aggregate Data Models

Aggregation Model talks about data as a unit. It makes easier to manage data over a cluster.

When the unit of data gets retrieved from the NoSQL databases, it gets all the related data along with it.

Let us say we need to find Product by Category. In the relational model, we use normalization technique and we create two tables as Product and Category respectively. Whenever we need to retrieve the details about Product by Category then we perform a join operation and retrieve the details.

While as in NoSQL databases, we create one document which holds product as well as category information.

Product =

{

sku: 321342,

name:book

price:50.00

subject: mathematics

item_in_stocks: 5000

category:[{id:1,name:math5},{id:2,name:math6}]

}

Pig Tutorial – Hadoop Pig Introduction, Pig Latin, Use Cases, Examples

In this series, we will cover Pig tutorial. Apache Pig provides a platform for executing large data sets in a distributed fashion on the cluster of commodity machines.

Pig tutorial – Pig Latin Introduction

The language which is used to execute the data sets is called Pig Latin. It allows you to perform data transformation such as join, sort, filter, and grouping of records etc.

It is sort of ETL process for Big Data environment. It also facilitates users to create their own functions for reading, processing, and writing data.

Pig is an open source project developed by Apache consortium (http://pig.apache.org). Therefore, users are free to download it as the source or binary code.

Pig Latin programs run on Hadoop cluster and it makes use of both Hadoop distributed file system, as well as MapReduce programming layer.

However, for prototyping Pig Latin programs can also run in “local mode” without a cluster. All the process invoked during running the program in local mode resides in single local JVM.

Using Pig Latin, programmers can perform MapReduce tasks easily without having to type complex Java codes. The map, sort, shuffle and reduce phase while using pig Latin language can be taken care internally by the operators and functions you will use in pig script.

Basic “hello world program” using Apache Pig

The basic “hello world program” in Hadoop is the word count program.

The same example is explained in “Hadoop and HDFS” tutorial using JAVA map-reduce program.

Now, let’s look at using Pig Latin program. Let us consider our input is a text file with words delimited by space and lines terminated by ‘\n’ stored in “src.txt” file. Sample data of the file as below:

Old MacDonald had a farm
And on his farm he had some chicks
With a chick chick here
And a chick chick there
Here a chick there a chick
Everywhere a chick chick

The word count program for above sample data using Pig Latin is as below:

A = load 'src.txt' as line;

--TOKENIZE splits the line into a field for each word. e.g. ({(Old),(MacDonald),(had),(a),(farm)})

B = foreach A generate TOKENIZE(line) as tokens;

--Convert the output of TOKENIZE into separate record. e.g. (Old)

-- (MacDonald)

-- (had)

-- (a)

-- (farm)

C = foreach B generate FLATTEN(tokens) as words;

--We have to count each word occurrences, for that we have to group all the words.

D = group C by words;

E = foreach D generate group, COUNT(C);

F = order E by $1;

-- We can print the word count on console using Dump.

dump F;

Sample output will be as follows:

(MacDonald,1)
(had,1)

Pig Use Cases – Examples

Pig Use Case#1

The weblog can be processed using pig because it has a goldmine of information. Using this information we can analyze the overall server usages and improve the server performance.

We can create the Usage Tracking mechanism such as monitoring users, processes, preempting security attacks on your server.

We can also analyze the frequent errors and take a corrective measure to enhance user experience.

Pig Use Case#2

To know about the effectiveness of an advertisement is one of the important goals for any companies.

Many companies invest millions of dollars in buying the ads space and for them and it is critical to know how popular their advertisement both in physical and virtual space.

Gathering advertising information from multiple sources and analysis to understand the customer behavior and its effectiveness is one of the important goals for many companies. This can be easily achieved by using Pig Latin language.

Pig Use Case#3

Processing the healthcare information is one of the important use cases of Pig.

Neural Network for Breast Cancer Data Built on Google App Engine is one of the important application developed using pig and Hadoop.

Pig Use Case#4

Stock analysis using Pig Latin such as to calculate average dividend, Total trade estimation, Group the stocks and Join them.