MongoDB

MongoDB is an open-source document-based database management tool that stores data in JSON-like formats. It is a highly scalable, flexible, and distributed NoSQL database.

With the rise in data all around the world, there has been an observable and increasing interest surrounding the wave of the non-relational database, also known as ‘NoSQL‘. Businesses and organisations are seeking new methods to manage the flood of data and are drawn toward the alternate database management tools and systems that are different from the traditional relational database systems. Here comes MongoDB into the picture.

Being a NoSQL tool means that it does not use the usual rows and columns that you so much associate with the relational database management. It is an architecture that is built on collections and documents. The basic unit of data in this database consists of a set of key–value pairs.It allows documents to have different fields and structures. This database uses a document storage format called BSON which is a binary style of JSON documents.The data model that MongoDB follows is a highly elastic one that lets you combine and store data of multivariate types without having to compromise on the powerful indexing options, data access, and validation rules. There is no downtime when you want to dynamically modify the schemas. What it means that you can concentrate more on making your data work harder rather than spending more time on preparing the data for the database.

Architecture of MongoDB NoSQL Database

Database: In simple words, it can be called the physical container for data. Each of the databases has its own set of files on the file system with multiple databases existing on a single MongoDB server.

Collection: A group of database documents can be called a collection. The RDBMS equivalent to a collection is a table. The entire collection exists within a single database. There are no schemas when it comes to collections. Inside the collection, various documents can have varied fields, but mostly the documents within a collection are meant for the same purpose or for serving the same end goal.

Document: A set of key–value pairs can be designated as a document. Documents are associated with dynamic schemas. The benefit of having dynamic schemas is that a document in a single collection does not have to possess the same structure or fields. Also, the common fields in a collection’s document can have varied types of data.

What makes it different from RDBMS?

You can directly compare the MongoDB NoSQL with the RDBMS and map the varied terminologies in the two systems: The RDBMS table is a MongoDB collection, the column is a field, the tuple/row is a document, and the table join is an embedded document. The typical schema of a relational database shows the number of tables and the relationship between the tables, but MongoDB does not follow the concept of relationship.

Go through the following table to understand how exactly an expert NoSQL database like MongoDB differs from RDBMS.

MongoDBRDBMS
Document-oriented and non-relational databaseRelational database
Document basedRow based
Field basedColumn based
Collection based and key–value pairTable based
Gives JavaScript client for queryingDoesn’t give JavaScript for querying
Relatively easy to setupComparatively not that easy to setup
Unaffected by SQL injectionQuite vulnerable to SQL injection
Has dynamic schema and ideal for hierarchical data storageHas predefined schema and not good for hierarchical data storage
100 times faster and horizontally scalable through shardingBy increasing RAM, vertical scaling can happen

Important Features of MongoDB

  • Queries: It supports ad-hoc queries and document-based queries.
  • Index Support: Any field in the document can be indexed.
  • Replication: It supports Master–Slave replication. MongoDB uses native application to maintain multiple copies of data. Preventing database downtime is one of the replica set’s features as it has self-healing shard.
  • Multiple Servers: The database can run over multiple servers. Data is duplicated to foolproof the system in the case of hardware failure.
  • Auto-sharding: This process distributes data across multiple physical partitions called shards. Due to sharding, MongoDB has an automatic load balancing feature.
  • MapReduce: It supports MapReduce and flexible aggregation tools.
  • Failure Handling: In MongoDB, it’s easy to cope with cases of failures. Huge numbers of replicas give out increased protection and data availability against database downtime like rack failures, multiple machine failures, and data center failures, or even network partitions.
  • GridFS: Without complicating your stack, any sizes of files can be stored. GridFS feature divides files into smaller parts and stores them as separate documents.
  • Schema-less Database: It is a schema-less database written in C++.
  • Document-oriented Storage: It uses BSON format which is a JSON-like format.
  • Procedures: MongoDB JavaScript works well as the database uses the language instead of procedures.

Why do you need MongoDB technology?

This technology overcame one of the biggest pitfalls of the traditional database systems, that is, scalability. With the ever evolving needs of businesses, their database systems also needed to be upgraded. MongoDB has exceptional scalability. It makes it easy to fetch the data and provides continuous and automatic integration. Along with these benefits, there are multiple reasons why you need MongoDB:

  • No downtime while the application is being scaled
  • Performs in-memory processing
  • Text search
  • Graph processing
  • Global replication
  • Economical

MongoDB is meeting the business requirements. Here is how:

  • MongoDB provides the right mix of technology and data for competitive advantage.
  • It is most suited for mission-critical applications since it considerably reduces risks.
  • It increasingly accelerated the time to value (TTV) and lowered the total cost of ownership.
  • It builds applications that are just not possible with traditional relational databases.

Benefits of MongoDB:

Distributed Data Platform: Throughout geographically distributed data centers and cloud regions, MongoDB can be run ensuring new levels of availability and scalability. With no downtime and without changing your application, MongoDB scales elastically in terms of data volume and throughput. The technology gives you enough flexibility across various data centers with good consistency.

Fast and Iterative Development: Changing business requirements will no longer affect successful project delivery in your enterprise. A flexible data model with dynamic schema, and with powerful GUI and command line tools, makes it fast for developers to build and evolve applications. Automated provisioning enables continuous integration and delivery for productive operations. Static relational schemas and complex operations of RDBMS are now something from the past.

Flexible Data Model: MongoDB stores data in flexible JSON-like documents, which makes data persistence and combining easy. The objects in your application code is mapped to the document model, due to which working with data becomes easy. Needless to say that schema governance controls, data access, complex aggregations, and rich indexing functionality are not compromised in any way. Without downtime, one can modify the schema dynamically. Due to this flexibility, a developer needs to worry less about data manipulation.

Reduced TCO (Total Cost of Ownership): Application developers can do their job way better when MongoDB is used. The operations team also can perform their job well, thanks to the Atlas Cloud service. Costs are significantly lowered as MongoDB runs on commodity hardware. The technology gives out on-demand, pay-as-you-go pricing with annual subscriptions, along with 24/7 global support.

Integrated Feature Set: One can get a variety of real-time applications because of analytics and data visualization, event-driven streaming data pipelines, text and geospatial search, graph processing, in-memory performance, and global replication reliably and securely. For RDBMS to accomplish this, there requires additional complex technologies, along with separate integration requirements.

Long-term Commitment: You would be staggered to know about the development of this technology. It has garnered over 30 million downloads, 4,900 customers, and over 1,000 partners. If you include this technology in your firm, then you can be sure that your investment is in the right place.

MongoDB cannot support the SQL language for obvious reasons. MongoDB querying style is dynamic on documents as it is a document-based query language that can be as utilitarian as SQL. MongoDB is easy to scale, and there is no need to convert or map application objects to database objects. It deploys the internal memory for providing faster access to data and storing the working set.

Frequently Used Commands in MongoDB

Database Creation

  • MongoDB doesn’t have any methods to create a database. It automatically creates a database when you save values into the defined collection for the first time. The following command will create a database named ‘database_name’ if it doesn’t exist. If it does exist, then it will be selected.
  • Command: Use Database_name

Dropping Databases

  • The following command is used to drop a database, along with its associated files. This command acts on the current database.
  • Command: db.dropDatabase()

Creating a Collection

  • MongoDB uses the following command to create a collection. Normally, this is not required as MongoDB automatically creates collections when some documents are inserted.
  • Command: db.createCollection(name, options)
  • Name: The string type which specifies the name of the collection to be created
  • Options: The document type which specifies the memory size and the indexing of the collection. It is an optional parameter.

Showing Collections

  • When MongoDB runs the following command, it will display all the collections in the server.
  • Command: In shell you can type: db.getCollectionNames()

$in Operator

  • The $in operator selects those documents where the value of a field is equal to the value in the specified array. To use the $in expression, use the following prototype:
  • Command: { field: { $in: [<value1>, <value2>, … <valueN> ] } }

Projection

  • Often you need only specific parts of the database rather than the whole database. Find() method displays all fields of a document. You need to set a list of fields with value 1 or 0. 1 is used to show the field and 0 is used to hide it. This ensures that only those fields with value 1 are selected. Among MongoDB query examples, there is one which defines projection as the following query.
  • Command: db.COLLECTION_NAME.find({},{KEY:1})

Date Operator

  • This command is used to denote time.
  • Command:
    Date() – It returns the current date as a string.
    New Date() – It returns the current date as a date object.

$not Operator

  • $not does a logical NOT operation on the specified <operator-expression> and selects only those documents that don’t match the <operator-expression>. This includes documents that do not contain the field.
  • Command: { field: { $not: { <operator-expression> } } }

Delete Commands

  •  
  • Following are commands which explain MongoDB’s delete capabilities.
  • Commands:
    collection.remove() – It deletes a single document that matches a filter. db.collection.deleteOne() – It deletes up to only a single document even if the command selects more than one document.
    db.collection.deletemany() – It deletes all the documents that match the specified filter.

Where Command

  • To pass either a string which has a JavaScript expression or a full JavaScript function to the query system, the following operator can be used.
  • Command: $where

The forEach Command

  • JavaScript function is applied to each document from the cursor while iterating the cursor.
  • Command: cursor.forEach(function)

Where can you use MongoDB NoSQL database?

The MongoDB NoSQL database can be extensively used for Big Data and Hadoop applications for working with humongous amounts of NoSQL data that is a major portion of Big Data. MongoDB and SQL are all database systems, but what sets them apart is their efficiency in today’s world. MongoDB can also be successfully deployed for social media and mobile applications for parsing all streaming information which is in the unstructured format. Content management and delivery also sees extensive use for the MongoDB NoSQL database. Other domains are user data management and data hubs.

Some of the biggest companies on earth are successfully deploying MongoDB with already over half of the Fortune 100 companies being customers of this incredible NoSQL database system. It has a very vibrant ecosystem with over 100 partners and huge investor interest who are pouring money in the technology, relentlessly.

One of the biggest insurance companies on earth MetLife is extensively using MongoDB for its customer service applications; the online classifieds search portal, Craigslist is deeply involved in archiving its data using MongoDB. One of the most hailed brands in the media industry, The New York Times is using MongoDB for its photo submissions and the application that is deployed for form-building. Finally, the extent of the MongoDB dominance can be gauged by the fact that the world’s premier scientific endeavor that is spearheaded by the CERN physics laboratory is extensively using MongoDB for its data aggregation and data discovery applications.

How will this technology help you in your career growth?

  • MongoDB is the most widely used NoSQL database application InfoWorld
  • A MongoDB Database Administrator in the United States can earn up to $129,000 per annum – Indeed
  • Hadoop and NoSQL markets are expected to reach $3.3 billion within the next two years – Wikibon

MongoDB is a very useful NoSQL database that is being used by some of the biggest corporations in the world. Due to some of the most powerful features of MongoDB, it offers a never before seen set of features to enterprises in order to parse all their unstructured data. Due to this, professionals who are qualified and certified in working with the basics and the advanced levels of MongoDB tool can expect to see their careers soar at a tremendous pace without any doubt. Due to its versatile and scalable nature, MongoDB can be used for datasets like social media, videos, and so on. MongoDB clients and users won’t feel a need for any other kind of databases.

Cassandra Versus MongoDB

Today in this world where your existence is judged by your online presence, the amount of data is increasing more and more everyday and its storage and management has become one of the very big issues. Data scientists are working hard and are inventing newer techniques for handling such big data every single day. Social media, IT industries and every way of using the internet have been collectively taking part in enhancing the dimension of data. The existing ways of storing data are lesser in number compared to the total amount of data. The available rows and columns are not enough to take care of the continuously enlarging data because most of the data generated are  the unstructured ones.

As the size of data keeps on changing, the scientists have discovered that the conventional databases have to be replaced with the newer and advanced ways of storing tools. NoSQL and Hadoop are faster-growing technologies which companies use for storage and management of their data.  Although Hadoop gets more recognition for data storage, but observing various surveys it is found that practically NoSQL is better and more advanced. NoSQL assembles effective relevance that constrains the production all the way through whereabouts of commitments. In this article of mine, I will be discussing two NoSQL databases named as MongoDB and Cassandra.

MongoDB

MongoDB is a reasonable move, towards a large number of applications. Its activities and performance are similar to the conventional and old style of storage systems. So it is quite easy and comfortable to use. This database is quite elastic and expandable as a result of which it has become user-friendly and also helps users in the network. Only because of its ease of use, MongoDB is popular among the engineers who take no time in working with this database. It has got a master and slave architecture.

When we use MongoDB, we use the same data model in both the database as well as in the code, hence it requires no layering of complicated mapping. As a result, it becomes very simple to use which makes it immensely popular among the users.

It is never tough in going with MongoDB because companies which know this tool can take their investments in return making it tension free to stay reliable only on few databases. It is prepared for use in the online transaction processes. It performs and solves complicated situations but still it cannot be regarded as the perfect one. It does not help in the complicated transactions.

Advantages of MongoDB

  • Scalability
  • Flexibility
  • User-friendly
  • No concept of rows and columns
  • No re-establishment of indexing

Disadvantages of MongoDB

  • Memory is not expandable
  • Joins can be done only through multiple queries
  • No transactions can be done

Cassandra

MongoDB is popular due to its ease of use but Cassandra is popular for its ease of management facility that too in expanded form. When users tend to construct the conventional data more dependable with more speed, they will come closer towards Cassandra. Cassandra has a structural design where the whole sum of space is stretchable by the accumulation of external devices in allied on rows and columns by means of their own assets. It supports a multiple numbers of data centers working together. With a master-less architecture, Cassandra offers a great performance by its quality of great scalability, awesome writings and also great solving of queries.

Deploying newer technologies for you becomes very simple and comfortable once you know the interiors of Cassandra technology. The training of Cassandra is just a question of only some hours.  A proper training and certification of Cassandra lead you to immense understandings and an ocean of opportunities. Once you become completely aware of the Cassandra data model and its functioning processes, one can successfully develop Cassandra’s applications.

Advantages of Cassandra

  • Free of cost
  • Peer to peer structural design
  • Elasticity
  • Fault tolerance
  • Great Performer
  • Column based
  • Adjustable steadiness

Disadvantages of Cassandra

  • No support to Data Integration
  • No streaming of globule values.
  • No cursor support,
  • Large outputs must be physically paged

This was about few of the differences between the two databases – Cassandra and MongoDB. If you guys have any other points which I may have left, please do share by writing down in the comment box.

Hadoop Connector

The MongoDB Hadoop Adapter is a plugin for Hadoop that provides Hadoop the ability to use MongoDB as an input source and/or an output source.

Installation

The MongoDB Hadoop Adapter uses the SBT Build Tool tool for compilation. SBT provides superior support for discrete configurations targeting multiple Hadoop versions. The distribution includes self-bootstrapping copy of SBT in the distribution as sbt. Create a copy of the jar files using the following command:

./sbt package

The MongoDB Hadoop Adapter supports a number of Hadoop releases. You can change the Hadoop version supported by the build by modifying the value of hadoopRelease in the build.sbt file. For instance, set this value to:

hadoopRelease in ThisBuild := "cdh3"

configures a build against Cloudera CDH3u3.

While:

hadoopRelease in ThisBuild := "0.21"

configures a build against Hadoop 0.21 from the mainline Apache distribution.

After building, you will need to place the “core” jar and the mongo-java-driver in the lib directory of each Hadoop server.

Getting Started with Hadoop

MongoDB and Hadoop are a powerful combination and can be used together to deliver complex analytics and data processing for data stored in MongoDB. The following guide shows how you can start working with the MongoDB-Hadoop adapter. Once you become familiar with the adapter, you can use it to pull your MongoDB data into Hadoop Map-Reduce jobs, process the data and return results back to a MongoDB collection.

MongoDB

The latest version of MongoDB should be installed and running. In addition, the MongoDB commands should be in your $PATH.

Miscellaneous

In addition to Hadoop, you should also have git and JDK 1.6 installed.

Building MongoDB Adapter

The MongoDB-Hadoop adapter source is available on github. First, clone the repository and get the release-1.0 branch:

git clone https://github.com/mongodb/mongo-hadoop.git
git checkout release-1.0

Now, edit build.sbt and update the build target in hadoopRelease in ThisBuild. In this example, we’re using the CDH3 Hadoop distribution from Cloudera so I’ll set it as follows:

hadoopRelease in ThisBuild := "cdh3"

To build the adapter, use the self-bootstrapping version of sbt that ships with the MongoDB-Hadoop adapter:

./sbt package

Once the adapter is built, you will need to copy it and the latest stable version of the MongoDB Java driver to your $HADOOP_HOME/lib directory. For example, if you have Hadoop installed in /usr/lib/hadoop:

wget --no-check-certificate https://github.com/downloads/mongodb/mongo-java-driver/mongo-2.7.3.jar
cp mongo-2.7.3.jar /usr/lib/hadoop/lib/
cp core/target/mongo-hadoop-core_cdh3u3-1.0.0.jar /usr/lib/hadoop/lib/

Examples

Load Sample Data

The MongoDB-Hadoop adapter ships with a few examples of how to use the adapter in your own setup. In this guide, we’ll focus on the UFO Sightings and Treasury Yield examples. To get started, first load the sample data for these examples:

./sbt load-sample-data

To confirm that the sample data was loaded, start the mongo client and look for the mongo_hadoop database and be sure that it contains the ufo_sightings.in and yield_historical.in collections:

$ mongo
MongoDB shell version: 2.0.5
connecting to: test
> show dbs
mongo_hadoop    0.453125GB
> use mongo_hadoop
switched to db mongo_hadoop
> show collections
system.indexes
ufo_sightings.in
yield_historical.in

 

Treasury Yield

To build the Treasury Yield example, we’ll need to first edit one of the configuration files uses by the example code :

emacs examples/treasury_yield/src/main/resources/mongo-treasury_yield.xml

and set the MongoDB location for the input (mongo.input.uri) and output (mongo.output.uri ) collections (in this example, Hadoop is running on a single node alongside MongoDB):

...
  <property>
    <!-- If you are reading from mongo, the URI -->
    <name>mongo.input.uri</name>
    <value>mongodb://127.0.0.1/mongo_hadoop.yield_historical.in</value>
  </property>
  <property>
    <!-- If you are writing to mongo, the URI -->
    <name>mongo.output.uri</name>
    <value>mongodb://127.0.0.1/mongo_hadoop.yield_historical.out</value>
  </property>
...

Next, edit the main class that we’ll use for our MapReduce job (TreasuryYieldXMLConfig.java):

emacs examples/treasury_yield/src/main/java/com/mongodb/hadoop/examples/treasury/TreasuryYieldXMLConfig.java

and update the class definition as follows:

...
public class TreasuryYieldXMLConfig extends MongoTool {
 
    static{
        // Load the XML config defined in hadoop-local.xml
        // Configuration.addDefaultResource( "hadoop-local.xml" );
        Configuration.addDefaultResource( "mongo-defaults.xml" );
        Configuration.addDefaultResource( "mongo-treasury_yield.xml" );
    }
 
    public static void main( final String[] pArgs ) throws Exception{
        System.exit( ToolRunner.run( new TreasuryYieldXMLConfig(), pArgs ) );
    }
}
...

Now let’s build the Treasury Yield example:

./sbt treasury-example/package

Once the example is done building we can submit our MapReduce job:

hadoop jar examples/treasury_yield/target/treasury-example_cdh3u3-1.0.0.jar com.mongodb.hadoop.examples.treasury.TreasuryYieldXMLConfig

This job should only take a few moments as it’s a relatively small amount of data. Now check the output collection data in MongoDB to confirm that the MapReduce job was successful:

$ mongo
MongoDB shell version: 2.0.5
connecting to: test
> use mongo_hadoop
switched to db mongo_hadoop
> db.yield_historical.out.find()
{ "_id" : 1990, "value" : 8.552400000000002 }
{ "_id" : 1991, "value" : 7.8623600000000025 }
{ "_id" : 1992, "value" : 7.008844621513946 }
{ "_id" : 1993, "value" : 5.866279999999999 }
{ "_id" : 1994, "value" : 7.085180722891565 }
{ "_id" : 1995, "value" : 6.573920000000002 }
{ "_id" : 1996, "value" : 6.443531746031742 }
{ "_id" : 1997, "value" : 6.353959999999992 }
{ "_id" : 1998, "value" : 5.262879999999994 }
{ "_id" : 1999, "value" : 5.646135458167332 }
{ "_id" : 2000, "value" : 6.030278884462145 }
{ "_id" : 2001, "value" : 5.02068548387097 }
{ "_id" : 2002, "value" : 4.61308 }
{ "_id" : 2003, "value" : 4.013879999999999 }
{ "_id" : 2004, "value" : 4.271320000000004 }
{ "_id" : 2005, "value" : 4.288880000000001 }
{ "_id" : 2006, "value" : 4.7949999999999955 }
{ "_id" : 2007, "value" : 4.634661354581674 }
{ "_id" : 2008, "value" : 3.6642629482071714 }
{ "_id" : 2009, "value" : 3.2641200000000037 }
has more

UFO Sightings

This will follow much of the same process as with the Treasury Yield example with one extra step; we’ll need to add an entry into the build file to compile this example. First, open the file for editing:

emacs project/MongoHadoopBuild.scala

Next, add the following lines starting at line 72 in the build file:

...
  lazy val ufoExample = Project( id = "ufo-sightings",
                                base = file("examples/ufo_sightings"),
                                settings = exampleSettings ) dependsOn ( core )
...

Now edit the UFO Sightings config file:

emacs examples/ufo_sightings/src/main/resources/mongo-ufo_sightings.xml

and update the mongo.input.uri and mongo.output.uri properties:

...
  <property>
    <!-- If you are reading from mongo, the URI -->
    <name>mongo.input.uri</name>
    <value>mongodb://127.0.0.1/mongo_hadoop.ufo_sightings.in</value>
  </property>
  <property>
    <!-- If you are writing to mongo, the URI -->
    <name>mongo.output.uri</name>
    <value>mongodb://127.0.0.1/mongo_hadoop.ufo_sightings.out</value>
  </property>
...

Next edit the main class for the MapReduce job in UfoSightingsXMLConfig.java to use the configuration file:

emacs examples/ufo_sightings/src/main/java/com/mongodb/hadoop/examples/ufos/UfoSightingsXMLConfig.java
...
public class UfoSightingsXMLConfig extends MongoTool {
 
    static{
        // Load the XML config defined in hadoop-local.xml
        // Configuration.addDefaultResource( "hadoop-local.xml" );
        Configuration.addDefaultResource( "mongo-defaults.xml" );
        Configuration.addDefaultResource( "mongo-ufo_sightings.xml" );
    }
 
    public static void main( final String[] pArgs ) throws Exception{
        System.exit( ToolRunner.run( new UfoSightingsXMLConfig(), pArgs ) );
    }
}
...

Now build the UFO Sightings example:

./sbt ufo-sightings/package

Once the example is built, execute the MapReduce job:

hadoop jar examples/ufo_sightings/target/ufo-sightings_cdh3u3-1.0.0.jar com.mongodb.hadoop.examples.UfoSightingsXMLConfig

This MapReduce job will take just a bit longer than the Treasury Yield example. Once it’s complete, check the output collection in MongoDB to see that the job was successful:

$ mongo
MongoDB shell version: 2.0.5
connecting to: test
> use mongo_hadoop
switched to db mongo_hadoop
> db.ufo_sightings.out.find().count()
21850

Hadoop and MongoDB Use Cases

The following are some example deployments with MongoDB and Hadoop. The goal is to provide a high-level description of how MongoDB and Hadoop can fit together in a typical Big Data stack. In each of the following examples MongoDB is used as the “operational” real-time data store and Hadoop is used for offline batch data processing and analysis.

Batch Aggregation

In several scenarios the built-in aggregation functionality provided by MongoDB is sufficient for analyzing your data. However, in certain cases, significantly more complex data aggregation may be necessary. This is where Hadoop can provide a powerful framework for complex analytics.

In this scenario data is pulled from MongoDB and processed within Hadoop via one or more Map-Reduce jobs. Data may also be brought in from additional sources within these Map-Reduce jobs to develop a multi-datasource solution. Output from these Map-Reduce jobs can then be written back to MongoDB for later querying and ad-hoc analysis. Applications built on top of MongoDB can now use the information from the batch analytics to present to the end user or to drive other downstream features.

Data Warehouse

In a typical production scenario, your application’s data may live in multiple datastores, each with their own query language and functionality. To reduce complexity in these scenarios, Hadoop can be used as a data warehouse and act as a centralized repository for data from the various sources.

In this situation, you could have periodic Map-Reduce jobs that load data from MongoDB into Hadoop. This could be in the form of “daily” or “weekly” data loads pulled from MongoDB via Map-Reduce. Once the data from MongoDB is available from within Hadoop, and data from other sources are also available, the larger dataset data can be queried against. Data analysts now have the option of using either Map-Reduce or Pig to create jobs that query the larger datasets that incorporate data from MongoDB.

ETL Data

MongoDB may be the operational datastore for your application but there may also be other datastores that are holding your organization’s data. In this scenario it is useful to be able to move data from one datastore to another, either from your application’s data to another database or vice versa. Moving the data is much more complex than simply piping it from one mechanism to another, which is where Hadoop can be used.

In this scenario, Map-Reduce jobs are used to extract, transform and load data from one store to another. Hadoop can act as a complex ETL mechanism to migrate data in various forms via one or more Map-Reduce jobs that pull the data from one store, apply multiple transformations (applying new data layouts or other aggregation) and loading the data to another store. This approach can be used to move data from or to MongoDB, depending on the desired result.