Top 5 Big Data Vendors

Top five vendors offering Big Data Hadoop solutions are:

  • Cloudera
  • Amazon Web Services Elastic MapReduce Hadoop Distribution
  • Microsoft
  • MapR
  • IBM InfoSphere Insights

Let’s get a fair idea about all these vendors.

Cloudera & Hortonworks

This ranks top over all the Big Data vendors for making Hadoop a reliable Big Data platform. Cloudera Hadoop vendor has around 350+ paying customers including US army, Allstate, and Monsanto.

Cloudera occupies 53 percent of Hadoop market, followed by 11 percent by MapR, and 16 percent by Hortonworks. Cloudera’s customers value the marketable add-on tools such as Cloudera Manager, Navigator, and Impala.

CDH Overview

CDH is the most complete, tested, and popular distribution of Apache Hadoop and related projects. CDH delivers the core elements of Hadoop – scalable storage and distributed computing – along with a Web-based user interface and vital enterprise capabilities. CDH is Apache-licensed open source and is the only Hadoop solution to offer unified batch processing, interactive SQL and interactive search, and role-based access controls.

CDH provides:

  • Flexibility—Store any type of data and manipulate it with a variety of different computation frameworks including batch processing, interactive SQL, free text search, machine learning and statistical computation.
  • Integration—Get up and running quickly on a complete Hadoop platform that works with a broad range of hardware and software solutions.
  • Security—Process and control sensitive data.
  • Scalability—Enable a broad range of applications and scale and extend them to suit your requirements.
  • High availability—Perform mission-critical business tasks with confidence.
  • Compatibility—Leverage your existing IT infrastructure and investment.

Apache Hive Overview in CDH

Hive data warehouse software enables reading, writing, and managing large datasets in distributed storage. Using the Hive query language (HiveQL), which is very similar to SQL, queries are converted into a series of jobs that execute on a Hadoop cluster through MapReduce or Apache Spark.

Users can run batch processing workloads with Hive while also analyzing the same data for interactive SQL or machine-learning workloads using tools like Apache Impala or Apache Spark—all within a single platform.

As part of CDH, Hive also benefits from:

  • Unified resource management provided by YARN
  • Simplified deployment and administration provided by Cloudera Manager
  • Shared security and governance to meet compliance requirements provided by Apache Sentry and Cloudera Navigator

Use Cases for Hive

Because Hive is a petabyte-scale data warehouse system built on the Hadoop platform, it is a good choice for environments experiencing phenomenal growth in data volume. The underlying MapReduce interface with HDFS is hard to program directly, but Hive provides an SQL interface, making it possible to use existing programming skills to perform data preparation.

Hive on MapReduce or Spark is best-suited for batch data preparation or ETL:

  • You must run scheduled batch jobs with very large ETL sorts with joins to prepare data for Hadoop. Most data served to BI users in Impala is prepared by ETL developers using Hive.
  • You run data transfer or conversion jobs that take many hours. With Hive, if a problem occurs partway through such a job, it recovers and continues.
  • You receive or provide data in diverse formats, where the Hive SerDes and variety of UDFs make it convenient to ingest and convert the data. Typically, the final stage of the ETL process with Hive might be to a high-performance, widely supported format such as Parquet.

Hive Components

Hive consists of the following components:

  • The Metastore Database
  • HiveServer2

The Metastore Database

The metastore database is an important aspect of the Hive infrastructure. It is a separate database, relying on a traditional RDBMS such as MySQL or PostgreSQL, that holds metadata about Hive databases, tables, columns, partitions, and Hadoop-specific information such as the underlying data files and HDFS block locations.

The metastore database is shared by other components. For example, the same tables can be inserted into, queried, altered, and so on by both Hive and Impala. Although you might see references to the “Hive metastore”, be aware that the metastore database is used broadly across the Hadoop ecosystem, even in cases where you are not using Hive itself.

The metastore database is relatively compact, with fast-changing data. Backup, replication, and other kinds of management operations affect this database. See Configuring the Hive Metastore for CDH for details about configuring the Hive metastore.

Cloudera recommends that you deploy the Hive metastore, which stores the metadata for Hive tables and partitions, in “remote mode.” In this mode the metastore service runs in its own JVM process and other services, such as HiveServer2, HCatalog, and Apache Impala communicate with the metastore using the Thrift network API.

See Starting the Hive Metastore in CDH for details about starting the Hive metastore service.

HiveServer2

HiveServer2 is a server interface that enables remote clients to submit queries to Hive and retrieve the results. HiveServer2 supports multi-client concurrency, capacity planning controls, Sentry authorization, Kerberos authentication, LDAP, SSL, and provides support for JDBC and ODBC clients.

HiveServer2 is a container for the Hive execution engine. For each client connection, it creates a new execution context that serves Hive SQL requests from the client. It supports JDBC clients, such as the Beeline CLI, and ODBC clients. Clients connect to HiveServer2 through the Thrift API-based Hive service.

See Configuring HiveServer2 for CDH for details on configuring HiveServer2 and see Starting, Stopping, and Using HiveServer2 in CDH for details on starting/stopping the HiveServer2 service and information about using the Beeline CLI to connect to HiveServer2. For details about managing HiveServer2 with its native web user interface (UI), see Using HiveServer2 Web UI in CDH.

How Hive Works with Other Components

Hive integrates with other components, which serve as query execution engines or as data stores:

  • Hive on Spark
  • Hive and HBase
  • Hive on Amazon S3
  • Hive on Microsoft Azure Data Lake Store

Hive on Spark

Hive traditionally uses MapReduce behind the scenes to parallelize the work, and perform the low-level steps of processing a SQL statement such as sorting and filtering. Hive can also use Spark as the underlying computation and parallelization engine. See Running Apache Hive on Spark in CDH for details about configuring Hive to use Spark as its execution engine and see Tuning Apache Hive on Spark in CDH for details about tuning Hive on Spark.

Hive and HBase

Apache HBase is a NoSQL database that supports real-time read/write access to large datasets in HDFS. See Using Apache Hive with HBase in CDH for details about configuring Hive to use HBase. For information about running Hive queries on a secure HBase server, see Using Hive to Run Queries on a Secure HBase Server.

Hive on Amazon S3

Use the Amazon S3 filesystem to efficiently manage transient Hive ETL (extract-transform-load) jobs. For step-by-step instructions to configure Hive to use S3 and multiple scripting examples, see Configuring Transient Hive ETL Jobs to Use the Amazon S3 Filesystem. To optimize how Hive writes data to and reads data from S3-backed tables and partitions, see Tuning Hive Performance on the Amazon S3 Filesystem. For information about setting up a shared Amazon Relational Database Service (RDS) as your Hive metastore, see Configuring a Shared Amazon RDS as an HMS for CDH.

Hive on Microsoft Azure Data Lake Store

In CDH 5.12 and higher, both Hive on MapReduce2 and Hive on Spark can access tables on Microsoft Azure Data Lake store (ADLS). In contrast to Amazon S3, ADLS more closely resembles native HDFS behavior, providing consistency, file directory structure, and POSIX-compliant ACLs. See Configuring ADLS Gen1 Connectivity for information about configuring and using ADLS with Hive on MapReduce2.

How Impala Works with CDH

The following graphic illustrates how Impala is positioned in the broader Cloudera environment:

The Impala solution is composed of the following components:

  • Clients – Entities including Hue, ODBC clients, JDBC clients, and the Impala Shell can all interact with Impala. These interfaces are typically used to issue queries or complete administrative tasks such as connecting to Impala.
  • Hive Metastore – Stores information about the data available to Impala. For example, the metastore lets Impala know what databases are available and what the structure of those databases is. As you create, drop, and alter schema objects, load data into tables, and so on through Impala SQL statements, the relevant metadata changes are automatically broadcast to all Impala nodes by the dedicated catalog service introduced in Impala 1.2.
  • Impala – This process, which runs on DataNodes, coordinates and executes queries. Each instance of Impala can receive, plan, and coordinate queries from Impala clients. Queries are distributed among Impala nodes, and these nodes then act as workers, executing parallel query fragments.
  • HBase and HDFS – Storage for data to be queried.

Queries executed using Impala are handled as follows:

  1. User applications send SQL queries to Impala through ODBC or JDBC, which provide standardized querying interfaces. The user application may connect to any impalad in the cluster. This impalad becomes the coordinator for the query.
  2. Impala parses the query and analyzes it to determine what tasks need to be performed by impalad instances across the cluster. Execution is planned for optimal efficiency.
  3. Services such as HDFS and HBase are accessed by local impalad instances to provide data.
  4. Each impalad returns data to the coordinating impalad, which sends these results to the client.

Primary Impala Features

Impala provides support for:

  • Most common SQL-92 features of Hive Query Language (HiveQL) including SELECT, joins, and aggregate functions.
  • HDFS, HBase, and Amazon Simple Storage System (S3) storage, including:
    • HDFS file formats: delimited text files, Parquet, Avro, SequenceFile, and RCFile.
    • Compression codecs: Snappy, GZIP, Deflate, BZIP.
  • Common data access interfaces including:
    • JDBC driver.
    • ODBC driver.
    • Hue Beeswax and the Impala Query UI.
  • impala-shell command-line interface.
  • Kerberos authentication.

Apache Kudu Overview

Apache Kudu is a columnar storage manager developed for the Hadoop platform. Kudu shares the common technical properties of Hadoop ecosystem applications: It runs on commodity hardware, is horizontally scalable, and supports highly available operation.

Apache Kudu is a top-level project in the Apache Software Foundation.

Kudu’s benefits include:

  • Fast processing of OLAP workloads.
  • Integration with MapReduce, Spark, Flume, and other Hadoop ecosystem components.
  • Tight integration with Apache Impala, making it a good, mutable alternative to using HDFS with Apache Parquet.
  • Strong but flexible consistency model, allowing you to choose consistency requirements on a per-request basis, including the option for strict serialized consistency.
  • Strong performance for running sequential and random workloads simultaneously.
  • Easy administration and management through Cloudera Manager.
  • High availability. Tablet Servers and Master use the Raft consensus algorithm, which ensures availability as long as more replicas are available than unavailable. Reads can be serviced by read-only follower tablets, even in the event of a leader tablet failure.
  • Structured data model.

By combining all of these properties, Kudu targets support applications that are difficult or impossible to implement on currently available Hadoop storage technologies. Applications for which Kudu is a viable solution include:

  • Reporting applications where new data must be immediately available for end users
  • Time-series applications that must support queries across large amounts of historic data while simultaneously returning granular queries about an individual entity
  • Applications that use predictive models to make real-time decisions, with periodic refreshes of the predictive model based on all historical data

Kudu-Impala Integration

Apache Kudu has tight integration with Apache Impala, allowing you to use Impala to insert, query, update, and delete data from Kudu tablets using Impala’s SQL syntax, as an alternative to using the Kudu APIs to build a custom Kudu application. In addition, you can use JDBC or ODBC to connect existing or new applications written in any language, framework, or business intelligence tool to your Kudu data, using Impala as the broker.

  • CREATE/ALTER/DROP TABLE – Impala supports creating, altering, and dropping tables using Kudu as the persistence layer. The tables follow the same internal/external approach as other tables in Impala, allowing for flexible data ingestion and querying.
  • INSERT – Data can be inserted into Kudu tables from Impala using the same mechanisms as any other table with HDFS or HBase persistence.
  • UPDATE/DELETE – Impala supports the UPDATE and DELETE SQL commands to modify existing data in a Kudu table row-by-row or as a batch. The syntax of the SQL commands is designed to be as compatible as possible with existing solutions. In addition to simple DELETE or UPDATE commands, you can specify complex joins in the FROM clause of the query, using the same syntax as a regular SELECT statement.
  • Flexible Partitioning – Similar to partitioning of tables in Hive, Kudu allows you to dynamically pre-split tables by hash or range into a predefined number of tablets, in order to distribute writes and queries evenly across your cluster. You can partition by any number of primary key columns, with any number of hashes, a list of split rows, or a combination of these. A partition scheme is required.
  • Parallel Scan – To achieve the highest possible performance on modern hardware, the Kudu client used by Impala parallelizes scans across multiple tablets.
  • High-efficiency queries – Where possible, Impala pushes down predicate evaluation to Kudu, so that predicates are evaluated as close as possible to the data. Query performance is comparable to Parquet in many workloads.

Example Use Cases

Streaming Input with Near Real Time Availability

A common business challenge is one where new data arrives rapidly and constantly, and the same data needs to be available in near real time for reads, scans, and updates. Kudu offers the powerful combination of fast inserts and updates with efficient columnar scans to enable real-time analytics use cases on a single storage layer.

Time-Series Application with Widely Varying Access Patterns

A time-series schema is one in which data points are organized and keyed according to the time at which they occurred. This can be useful for investigating the performance of metrics over time or attempting to predict future behavior based on past data. For instance, time-series customer data might be used both to store purchase click-stream history and to predict future purchases, or for use by a customer support representative. While these different types of analysis are occurring, inserts and mutations might also be occurring individually and in bulk, and become available immediately to read workloads. Kudu can handle all of these access patterns simultaneously in a scalable and efficient manner.

Kudu is a good fit for time-series workloads for several reasons. With Kudu’s support for hash-based partitioning, combined with its native support for compound row keys, it is simple to set up a table spread across many servers without the risk of “hotspotting” that is commonly observed when range partitioning is used. Kudu’s columnar storage engine is also beneficial in this context, because many time-series workloads read only a few columns, as opposed to the whole row.

In the past, you might have needed to use multiple datastores to handle different data access patterns. This practice adds complexity to your application and operations, and duplicates your data, doubling (or worse) the amount of storage required. Kudu can handle all of these access patterns natively and efficiently, without the need to off-load work to other datastores.

Predictive Modeling

Data scientists often develop predictive learning models from large sets of data. The model and the data might need to be updated or modified often as the learning takes place or as the situation being modeled changes. In addition, the scientist might want to change one or more factors in the model to see what happens over time. Updating a large set of data stored in files in HDFS is resource-intensive, as each file needs to be completely rewritten. In Kudu, updates happen in near real time. The scientist can tweak the value, re-run the query, and refresh the graph in seconds or minutes, rather than hours or days. In addition, batch or incremental algorithms can be run across the data at any time, with near-real-time results.

Combining Data In Kudu With Legacy Systems

Companies generate data from multiple sources and store it in a variety of systems and formats. For instance, some of your data might be stored in Kudu, some in a traditional RDBMS, and some in files in HDFS. You can access and query all of these sources and formats using Impala, without the need to change your legacy systems.

Related Information

  • Apache Kudu Concepts and Architecture
  • Apache Kudu Installation and Upgrade
  • Kudu Security Overview
  • More Resources for Apache Kudu

Apache Sentry Overview

Apache Sentry is a granular, role-based authorization module for Hadoop. Sentry provides the ability to control and enforce precise levels of privileges on data for authenticated users and applications on a Hadoop cluster. Sentry currently works out of the box with Apache Hive, Hive Metastore/HCatalog, Apache Solr, Impala, and HDFS (limited to Hive table data).

Sentry is designed to be a pluggable authorization engine for Hadoop components. It allows you to define authorization rules to validate a user or application’s access requests for Hadoop resources. Sentry is highly modular and can support authorization for a wide variety of data models in Hadoop.

Apache Spark Overview

Apache Spark is a general framework for distributed computing that offers high performance for both batch and interactive processing. It exposes APIs for Java, Python, and Scala and consists of Spark core and several related projects.

You can run Spark applications locally or distributed across a cluster, either by using an interactive shell or by submitting an application. Running Spark applications interactively is commonly performed during the data-exploration phase and for ad hoc analysis.

To run applications distributed across a cluster, Spark requires a cluster manager. In CDH 6, Cloudera supports only the YARN cluster manager. When run on YARN, Spark application processes are managed by the YARN ResourceManager and NodeManager roles. Spark Standalone is no longer supported.

The Apache Spark 2 service in CDH 6 consists of Spark core and several related projects:

Spark SQL

Module for working with structured data. Allows you to seamlessly mix SQL queries with Spark programs.

Spark Streaming

API that allows you to build scalable fault-tolerant streaming applications.

MLlib

API that implements common machine learning algorithms.

The Cloudera Enterprise product includes the Spark features roughly corresponding to the feature set and bug fixes of Apache Spark 2.4. The Spark 2.x service was previously shipped as its own parcel, separate from CDH.

In CDH 6, the Spark 1.6 service does not exist. The port of the Spark History Server is 18088, which is the same as formerly with Spark 1.6, and a change from port 18089 formerly used for the Spark 2 parcel.

Unsupported Features

The following Spark features are not supported:

  • Apache Spark experimental features/APIs are not supported unless stated otherwise.
  • Using the JDBC Datasource API to access Hive or Impala is not supported
  • ADLS not Supported for All Spark Components. Microsoft Azure Data Lake Store (ADLS) is a cloud-based filesystem that you can access through Spark applications. Spark with Kudu is not currently supported for ADLS data. (Hive on Spark is available for ADLS in CDH 5.12 and higher.)
  • IPython / Jupyter notebooks is not supported. The IPython notebook system (renamed to Jupyter as of IPython 4.0) is not supported.
  • Certain Spark Streaming features not supported. The mapWithState method is unsupported because it is a nascent unstable API.
  • Thrift JDBC/ODBC server is not supported
  • Spark SQL CLI is not supported
  • GraphX is not supported
  • SparkR is not supported
  • Structured Streaming is supported, but the following features of it are not:
    • Continuous processing, which is still experimental, is not supported.
    • Stream static joins with HBase have not been tested and therefore are not supported.
  • Spark cost-based optimizer (CBO) not supported.


Amazon Web Services Elastic MapReduce Hadoop Distribution

Amazon Elastic MapReduce is a part of Amazon Web Services (AWS), and it exists since the initial times of Hadoop. AWS has a simple-to-utilize and well-arranged data analytic stand built on influential HDFS structural design. It is one of the highest ranking vendors with the uppermost market distributions across the globe.

DynamoDB is another major NoSQL database contributed by the AWS Hadoop merchant that is dropped to run in huge consumer websites.

Amazon Elastic MapReduce (Amazon EMR)

Amazon Elastic MapReduce (EMR) is an Amazon Web Services (AWS) tool for big data processing and analysis. Amazon EMR offers the expandable low-configuration service as an easier alternative to running in-house cluster computing.

Amazon EMR is based on Apache Hadoop, a Java-based programming framework that supports the processing of large data sets in a distributed computing environment. MapReduce is a software framework that allows developers to write programs that process massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers. It was developed at Google for indexing web pages and replaced their original indexing algorithms and heuristics in 2004.

Amazon EMR processes big data across a Hadoop cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3). The elastic in EMR’s name refers to its dynamic resizing ability, which allows it to ramp up or reduce resource use depending on the demand at any given time.

Processing big data with Amazon EMR

Amazon EMR is used for data analysis in log analysis, web indexing, data warehousing, machine learning, financial analysis, scientific simulation, bioinformatics and more. EMR also supports workloads based on Apache Spark, Presto and Apache HBase — the latter of which integrates with Hive and Pig for additional functionality.

Introduction to AWS EMR

Amazon EMR is a big data platform currently leading in cloud-native platforms for big data with its features like processing vast amounts of data quickly and at an cost-effective scale and all these by using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi and Presto, with auto-scaling capability of Amazon EC2 and storage scalability of Amazon S3, EMR gives the flexibility to run short-lived clusters that can automatically scale to meet demand task, or for long-running highly available clusters.

AWS EMR provides many functionalities that makes thing easier for us, some of the technologies are:

  1. Amazon EC2
  2. Amazon RDS
  3. Amazon S3
  4. Amazon CloudFront
  5. Amazon Auto Scaling
  6. Amazon Lambda
  7. Amazon Redshift
  8. Amazon Elastic MapReduce (EMR)

One of the major services provided by AWS EMR and we are going to deal with is Amazon EMR.

EMR commonly called Elastic Map Reduce comes over with an easy and approachable way to deal with the processing of larger chunks of data. Imagine a big data scenario where we have a huge amount of data and we are performing a set of operations over them, say a Map-Reduce job is running, one of the major issue the Bigdata application faces is the tuning of the program, we often find it difficult to fine-tune our program in such a way all the resource allocated is consumed properly. Due to this above tuning factor, the time taken for processing increases gradually.

Elastic Map Reduce the service by Amazon, is a web service that provides a framework that manages all these necessary features needed for Big data processing in a cost-effective, fast, and secure manner. From cluster creation to data distribution over various instances all these things are easily managed under Amazon EMR. The services here are on-demand means we can control the numbers based on the data we have that makes if cost-efficient and scalable.

Reasons for Using AWS EMR

So Why Using AMR what makes it better from others. We often encounter a very basic problem where we are unable to allocate all the resources available over the cluster to any application, AMAZON EMR taking care of these problems and based on the size of data and the demand of application it allocates the necessary resource. Also, being Elastic in nature we can change it accordingly. EMR has huge application support be it Hadoop, Spark, HBase that makes it easier for Data processing. It supports various ETL operations quickly and cost-effectively. It Can also be used over for MLIB in Spark. We can perform various machine learning algorithms inside it. Be it Batch data or Real-Time Streaming of Data EMR is capable to organize and process both types of Data.

Working of AWS EMR

  1. The Clusters are the central component in the Amazon EMR architecture. They are a collection of EC2 Instances called Nodes. Each node has their specific roles within the cluster termed as Node type and based on their roles we can classify them in 3 types:
  2. Master Node
  3. Core Node
  4. Task Node
  • The Master Node as the name suggests is the master that is responsible for managing the cluster, running the components and distribution of data over the nodes for processing. It just keeps tracks whether everything is properly managed and running fine and works on in the case of failure.
  • The Core Node has the responsibility of running the task and store the data in HDFS in the cluster. All the processing parts are handled by the core Node and the data after that processing is put to the desired HDFS location.
  • The Task Node being optional only has the job to run the task this doesn’t store the data in HDFS.
  • Whenever after submitting a job, we have several methods to choose how the works need to be completed. Being it from termination of the cluster after job completion to a long-running cluster using EMR console and CLI to submit steps we have all the privilege to do so.
  • We can directly Run the Job on the EMR by connecting it with the master node through the interfaces and tools available that run jobs directly on the cluster.
  • We can also run our data in various steps with the help of EMR, all we have to do is submit one or more ordered steps in the EMR cluster. The Data is stored as a file and is processed in a sequential manner. Starting it from “Pending state to Completed state” we can trace the processing steps and find the errors also being it from ‘Failed to be Canceled’ all these steps can be easily traced back to this.
  • Once all the instance is terminated the completed state for the cluster is achieved.

Architecture for AWS EMR

The architecture of EMR introduces itself starting from the storage part to the Application part.

  • The very first layer comes with the storage layer which includes different file systems used with our cluster. Be It from HDFS to EMRFS to local file system these all are used for data storage over the entire application. Caching of the intermediate results during MapReduce processing can be achieved with the help of these technologies that come with EMR.
  • The Second layer comes with the Resource Management for the cluster, this layer is responsible for resource management for the clusters and nodes over the application. This basically helps as the management tools that helps to evenly distribute the data over cluster and proper managing. The Default resource Management tool that EMR uses is YARN that was introduced in Apache Hadoop 2.0. It centrally manages the resources for multiple data processing frameworks. It takes care of all the information that is needed for the cluster well-running being it from node health to resource distribution with memory management.
  • The Third layer comes with the Data processing Framework, this layer is responsible for the analysis and the processing of data. there are many frameworks supported by EMR that plays an important role in parallel and efficient data processing. Some of the framework it supports and we are aware of is APACHE HADOOP, SPARK, SPARK STREAMING, etc.
  • The Fourth layer coms with the Application and programs such as HIVE, PIG, streaming library, ML Algorithms that are helpful for processing and managing large data sets.

Advantages of AWS EMR

Let us now check some of the benefits of using EMR:

  1. High Speed: Since all the resources are utilized properly the Processing time for the query is comparatively faster than the other data processing tools have a much clear picture.
  2. Bulk Data Processing: Be larger the data size EMR has the capability for processing of huge amount of data in ample time.
  3. Minimal Data Loss: Since data are distributed over the cluster and processed parallelly over the network, there is a minimum chance for data loss and well, the accuracy rate for the processed data is better.
  4. Cost-Effective: Being cost-effective it is cheaper than any other alternative available that makes it strong over the industry usage. Since the pricing is less we can accommodate over large amounts of data and can process them within budget.
  5. AWS Integrated: It is integrated with all the services of AWS that makes easy availability under a roof so the security, storage, networking everything is integrated in one place.
  6. Security: It comes with an amazing Security group to control the inbound and outbound traffic also the use of IAM Roles makes it more secure as it comes up with various permissions that make data secure.
  7. Monitoring and deployment: we have proper monitoring tools for all the application that is running over EMR clusters that makes it transparent and easy for analysis portion also it comes with an auto-deployment feature where the application is configured and deployed automatically.

There are a lot more advantages to having EMR as a better choice other cluster computation method.

Microsoft Hadoop Distribution

Based on the current Hadoop distribution strategy of the vendors, Microsoft is an IT business not prominent for free foundation software solutions, still trying to make this platform work on Windows. It is offered as community cloud manufactured goods Microsoft Azure’s HDInsight mainly built to work with Azure.

An additional specialty in Microsoft is that its PolyBase feature helps customers hunt for data on the SQL Server during the implementation of the queries.

Apache Hadoop architecture in HDInsight

Apache Hadoop includes two core components: the Apache Hadoop Distributed File System (HDFS) that provides storage, and Apache Hadoop Yet Another Resource Negotiator (YARN) that provides processing. With storage and processing capabilities, a cluster becomes capable of running MapReduce programs to perform the desired data processing.

An HDFS is not typically deployed within the HDInsight cluster to provide storage. Instead, an HDFS-compatible interface layer is used by Hadoop components. The actual storage capability is provided by either Azure Storage or Azure Data Lake Storage. For Hadoop, MapReduce jobs executing on the HDInsight cluster run as if an HDFS were present and so require no changes to support their storage needs. In Hadoop on HDInsight, storage is outsourced, but YARN processing remains a core component. For more information, see Introduction to Azure HDInsight.

This article introduces YARN and how it coordinates the execution of applications on HDInsight.

Apache Hadoop YARN basics

YARN governs and orchestrates data processing in Hadoop. YARN has two core services that run as processes on nodes in the cluster:

  • ResourceManager
  • NodeManager

The ResourceManager grants cluster compute resources to applications like MapReduce jobs. The ResourceManager grants these resources as containers, where each container consists of an allocation of CPU cores and RAM memory. If you combined all the resources available in a cluster and then distributed the cores and memory in blocks, each block of resources is a container. Each node in the cluster has a capacity for a certain number of containers, therefore the cluster has a fixed limit on the number of containers available. The allotment of resources in a container is configurable.

When a MapReduce application runs on a cluster, the ResourceManager provides the application the containers in which to execute. The ResourceManager tracks the status of running applications, available cluster capacity, and tracks applications as they complete and release their resources.

The ResourceManager also runs a web server process that provides a web user interface to monitor the status of applications.

When a user submits a MapReduce application to run on the cluster, the application is submitted to the ResourceManager. In turn, the ResourceManager allocates a container on available NodeManager nodes. The NodeManager nodes are where the application actually executes. The first container allocated runs a special application called the ApplicationMaster. This ApplicationMaster is responsible for acquiring resources, in the form of subsequent containers, needed to run the submitted application. The ApplicationMaster examines the stages of the application, such as the map stage and reduce stage, and factors in how much data needs to be processed. The ApplicationMaster then requests (negotiates) the resources from the ResourceManager on behalf of the application. The ResourceManager in turn grants resources from the NodeManagers in the cluster to the ApplicationMaster for it to use in executing the application.

The NodeManagers run the tasks that make up the application, then report their progress and status back to the ApplicationMaster. The ApplicationMaster in turn reports the status of the application back to the ResourceManager. The ResourceManager returns any results to the client.

YARN on HDInsight

All HDInsight cluster types deploy YARN. The ResourceManager is deployed for high availability with a primary and secondary instance, which runs on the first and second head nodes within the cluster respectively. Only the one instance of the ResourceManager is active at a time. The NodeManager instances run across the available worker nodes in the cluster.

Hadoop in Azure HDInsight

Azure HDInsight is a fully managed, full-spectrum, open-source analytics service in the cloud for enterprises. The Apache Hadoop cluster type in Azure HDInsight allows you to use the Apache Hadoop Distributed File System (HDFS), Apache Hadoop YARN resource management, and a simple MapReduce programming model to process and analyze batch data in parallel. Hadoop clusters in HDInsight are compatible with Azure Blob storage, Azure Data Lake Storage Gen1, or Azure Data Lake Storage Gen2.

To see available Hadoop technology stack components on HDInsight, see Components and versions available with HDInsight. To read more about Hadoop in HDInsight, see the Azure features page for HDInsight.

A basic word count MapReduce job example is illustrated in the following diagram:

The output of this job is a count of how many times each word occurred in the text.

  • The mapper takes each line from the input text as an input and breaks it into words. It emits a key/value pair each time a word occurs of the word is followed by a 1. The output is sorted before sending it to reducer.
  • The reducer sums these individual counts for each word and emits a single key/value pair that contains the word followed by the sum of its occurrences.

MapReduce can be implemented in various languages. Java is the most common implementation, and is used for demonstration purposes in this document.

Use Spark & Hive Tools for Visual Studio Code

Learn how to use Apache Spark & Hive Tools for Visual Studio Code. Use the tools to create and submit Apache Hive batch jobs, interactive Hive queries, and PySpark scripts for Apache Spark. First we’ll describe how to install Spark & Hive Tools in Visual Studio Code. Then we’ll walk through how to submit jobs to Spark & Hive Tools.

Spark & Hive Tools can be installed on platforms that are supported by Visual Studio Code. Note the following prerequisites for different platforms.

Prerequisites

The following items are required for completing the steps in this article:

  • An Azure HDInsight cluster. To create a cluster, see Get started with HDInsight. Or use a Spark and Hive cluster that supports an Apache Livy endpoint.
  • Visual Studio Code.
  • Mono. Mono is required only for Linux and macOS.
  • A PySpark interactive environment for Visual Studio Code.
  • A local directory. This article uses C:\HD\HDexample.

Install Spark & Hive Tools

After you meet the prerequisites, you can install Spark & Hive Tools for Visual Studio Code by following these steps:

  1. Open Visual Studio Code.
  2. From the menu bar, navigate to View > Extensions.
  3. In the search box, enter Spark & Hive.
  4. Select Spark & Hive Tools from the search results, and then select Install:
  • Select Reload when necessary.

Open a work folder

To open a work folder and to create a file in Visual Studio Code, follow these steps:

  1. From the menu bar, navigate to File > Open Folder… > C:\HD\HDexample, and then select the Select Folder button. The folder appears in the Explorer view on the left.
  2. In Explorer view, select the HDexample folder, and then select the New File icon next to the work folder:
  • Name the new file by using either the .hql (Hive queries) or the .py (Spark script) file extension. This example uses HelloWorld.hql.

Set the Azure environment

For a national cloud user, follow these steps to set the Azure environment first, and then use the Azure: Sign In command to sign in to Azure:

  1. Navigate to File > Preferences > Settings.
  2. Search on the following string: Azure: Cloud.
  3. Select the national cloud from the list:

Connect to an Azure account

Before you can submit scripts to your clusters from Visual Studio Code, user can either sign in to Azure subscription, or link a HDInsight cluster. Use the Ambari username/password or domain joined credential for ESP cluster to connect to your HDInsight cluster. Follow these steps to connect to Azure:

  1. From the menu bar, navigate to View > Command Palette…, and enter Azure: Sign In:
  • Follow the sign-in instructions to sign in to Azure. After you’re connected, your Azure account name shows on the status bar at the bottom of the Visual Studio Code window.

Link a cluster

Link: Azure HDInsight

You can link a normal cluster by using an Apache Ambari-managed username, or you can link an Enterprise Security Pack secure Hadoop cluster by using a domain username (such as: [email protected]).

  1. From the menu bar, navigate to View > Command Palette…, and enter Spark / Hive: Link a Cluster.
  • Select linked cluster type Azure HDInsight.
  • Enter the HDInsight cluster URL.
  • Enter your Ambari username; the default is admin.
  • Enter your Ambari password.
  • Select the cluster type.
  • Set the display name of the cluster (optional).
  • Review OUTPUT view for verification.

The linked username and password are used if the cluster both logged in to the Azure subscription and linked a cluster.

Link: Generic Livy endpoint

  1. From the menu bar, navigate to View > Command Palette…, and enter Spark / Hive: Link a Cluster.
  2. Select linked cluster type Generic Livy Endpoint.
  3. Enter the generic Livy endpoint. For example: http://10.172.41.42:18080.
  4. Select authorization type Basic or None. If you select Basic:
    1. Enter your Ambari username; the default is admin.
    1. Enter your Ambari password.
  5. Review OUTPUT view for verification.

List clusters

  1. From the menu bar, navigate to View > Command Palette…, and enter Spark / Hive: List Cluster.
  2. Select the subscription that you want.
  3. Review the OUTPUT view. This view shows your linked cluster (or clusters) and all the clusters under your Azure subscription:

Set the default cluster

  1. Reopen the HDexample folder that was discussed earlier, if closed.
  2. Select the HelloWorld.hql file that was created earlier. It opens in the script editor.
  3. Right-click the script editor, and then select Spark / Hive: Set Default Cluster.
  4. Connect to your Azure account, or link a cluster if you haven’t yet done so.
  5. Select a cluster as the default cluster for the current script file. The tools automatically update the .VSCode\settings.json configuration file:

Submit interactive Hive queries and Hive batch scripts

With Spark & Hive Tools for Visual Studio Code, you can submit interactive Hive queries and Hive batch scripts to your clusters.

  1. Reopen the HDexample folder that was discussed earlier, if closed.
  2. Select the HelloWorld.hql file that was created earlier. It opens in the script editor.
  3. Copy and paste the following code into your Hive file, and then save it:

HiveQL

SELECT * FROM hivesampletable;
  • Connect to your Azure account, or link a cluster if you haven’t yet done so.
  • Right-click the script editor and select Hive: Interactive to submit the query, or use the Ctrl+Alt+I keyboard shortcut. Select Hive: Batch to submit the script, or use the Ctrl+Alt+H keyboard shortcut.
  • If you haven’t specified a default cluster, select a cluster. The tools also let you submit a block of code instead of the whole script file by using the context menu. After a few moments, the query results appear in a new tab:
  • RESULTS panel: You can save the whole result as a CSV, JSON, or Excel file to a local path or just select multiple lines.
    • MESSAGES panel: When you select a Line number, it jumps to the first line of the running script.

Submit interactive PySpark queries

Users can perform PySpark interactive in the following ways:

Using the PySpark interactive command in PY file

Using the PySpark interactive command to submit the queries, follow these steps:

  1. Reopen the HDexample folder that was discussed earlier, if closed.
  2. Create a new HelloWorld.py file, following the earlier steps.
  3. Copy and paste the following code into the script file:

Python

from operator import add
lines = spark.read.text("/HdiSamples/HdiSamples/FoodInspectionData/README").rdd.map(lambda r: r[0])
counters = lines.flatMap(lambda x: x.split(' ')) \
             .map(lambda x: (x, 1)) \
             .reduceByKey(add)
 
coll = counters.collect()
sortedCollection = sorted(coll, key = lambda r: r[1], reverse = True)
 
for i in range(0, 5):
     print(sortedCollection[i])
  • The prompt to install PySpark/Synapse Pyspark kernel is displayed in the lower right corner of the window. You can click on Install button to proceed for the PySpark/Synapse Pyspark installations; or click on Skip button to skip this step.
  • If you need to install it later, you can navigate to File > Preference > Settings, then uncheck HDInsight: Enable Skip Pyspark Installation in the settings.
  • If the installation is successful in step 4, the “PySpark installed successfully” message box is displayed in the lower right corner of the window. Click on Reload button to reload the window.
  • From the menu bar, navigate to View > Command Palette… or use the Shift + Ctrl + P keyboard shortcut, and enter Python: Select Interpreter to start Jupyter Server.
  • Select the python option below.
  • From the menu bar, navigate to View > Command Palette… or use the Shift + Ctrl + P keyboard shortcut, and enter Developer: Reload Window.
  1. Connect to your Azure account, or link a cluster if you haven’t yet done so.
  2. Select all the code, right-click the script editor, and select Spark: PySpark Interactive / Synapse: Pyspark Interactive to submit the query.
  1. Select the cluster, if you haven’t specified a default cluster. After a few moments, the Python Interactive results appear in a new tab. Click on PySpark to switch the kernel to PySpark / Synapse Pyspark, and the code will run successfully. If you want to switch to Synapse Pyspark kernel, disabling auto-settings in Azure portal is encouraged. Otherwise it may take a long while to wake up the cluster and set synapse kernel for the first time use. If The tools also let you submit a block of code instead of the whole script file by using the context menu:
  1. Enter %%info, and then press Shift+Enter to view the job information (optional):

The tool also supports the Spark SQL query:

Perform interactive query in PY file using a #%% comment

  1. Add #%% before the Py code to get notebook experience.
  • Click on Run Cell. After a few moments, the Python Interactive results appear in a new tab. Click on PySpark to switch the kernel to PySpark/Synapse PySpark, then, click on Run Cell again, and the code will run successfully.

Leverage IPYNB support from Python extension

  1. You can create a Jupyter Notebook by command from the Command Palette or by creating a new .ipynb file in your workspace. For more information, see Working with Jupyter Notebooks in Visual Studio Code
  2. Click on Run cell button, follow the prompts to Set the default spark pool (strongly encourage to set default cluster/pool every time before opening a notebook) and then, Reload window.
  • Click on PySpark to switch kernel to PySpark / Synapse Pyspark, and then click on Run Cell, after a while, the result will be displayed.

Submit PySpark batch job

  1. Reopen the HDexample folder that you discussed earlier, if closed.
  2. Create a new BatchFile.py file by following the earlier steps.
  3. Copy and paste the following code into the script file:

Python

from __future__ import print_function
import sys
from operator import add
from pyspark.sql import SparkSession
if __name__ == "__main__":
    spark = SparkSession\
        .builder\
        .appName("PythonWordCount")\
        .getOrCreate()
 
    lines = spark.read.text('/HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv').rdd.map(lambda r: r[0])
    counts = lines.flatMap(lambda x: x.split(' '))\
               .map(lambda x: (x, 1))\
                .reduceByKey(add)
    output = counts.collect()
    for (word, count) in output:
        print("%s: %i" % (word, count))
    spark.stop()
  • Connect to your Azure account, or link a cluster if you haven’t yet done so.
  • Right-click the script editor, and then select Spark: PySpark Batch, or Synapse: PySpark Batch*.
  • Select a cluster/spark pool to submit your PySpark job to:

After you submit a Python job, submission logs appear in the OUTPUT window in Visual Studio Code. The Spark UI URL and Yarn UI URL are also shown. If you submit the batch job to an Apache Spark pool, the Spark history UI URL and the Spark Job Application UI URL are also shown. You can open the URL in a web browser to track the job status.

Integrate with HDInsight Identity Broker (HIB)

Connect to your HDInsight ESP cluster with ID Broker (HIB)

You can follow the normal steps to sign in to Azure subscription to connect to your HDInsight ESP cluster with ID Broker (HIB). After sign-in, you’ll see the cluster list in Azure Explorer. For more instructions, see Connect to your HDInsight cluster.

Run a Hive/PySpark job on an HDInsight ESP cluster with ID Broker (HIB)

For run a hive job, you can follow the normal steps to submit job to HDInsight ESP cluster with ID Broker (HIB). Refer to Submit interactive Hive queries and Hive batch scripts for more instructions.

For run a interactive PySpark job, you can follow the normal steps to submit job to HDInsight ESP cluster with ID Broker (HIB). Refer to Submit interactive PySpark queries for more instructions.

For run a PySpark batch job, you can follow the normal steps to submit job to HDInsight ESP cluster with ID Broker (HIB). Refer to Submit PySpark batch job for more instructions.

Apache Livy configuration

Apache Livy configuration is supported. You can configure it in the .VSCode\settings.json file in the workspace folder. Currently, Livy configuration only supports Python script. For more information, see Livy README.

How to trigger Livy configuration

Method 1

  1. From the menu bar, navigate to File > Preferences > Settings.
  2. In the Search settings box, enter HDInsight Job Submission: Livy Conf.
  3. Select Edit in settings.json for the relevant search result.

Method 2

Submit a file, and notice that the .vscode folder is automatically added to the work folder. You can see the Livy configuration by selecting .vscode\settings.json.

  • The project settings:

Integrate with Azure HDInsight from Explorer

You can preview Hive Table in your clusters directly through the Azure HDInsight explorer:

  1. Connect to your Azure account if you haven’t yet done so.
  2. Select the Azure icon from leftmost column.
  3. From the left pane, expand AZURE: HDINSIGHT. The available subscriptions and clusters are listed.
  4. Expand the cluster to view the Hive metadata database and table schema.
  5. Right-click the Hive table. For example: hivesampletable. Select Preview.
  • The Preview Results window opens:
  • RESULTS panel

You can save the whole result as a CSV, JSON, or Excel file to a local path, or just select multiple lines.

  • MESSAGES panel
    • When the number of rows in the table is greater than 100, you see the following message: “The first 100 rows are displayed for Hive table.”
    • When the number of rows in the table is less than or equal to 100, you see the following message: “60 rows are displayed for Hive table.”
    • When there’s no content in the table, you see the following message: “0 rows are displayed for Hive table.

In Linux, install xclip to enable copy-table data.

Additional features

Spark & Hive for Visual Studio Code also supports the following features:

  • IntelliSense autocomplete. Suggestions pop up for keywords, methods, variables, and other programming elements. Different icons represent different types of objects:
  • IntelliSense error marker. The language service underlines editing errors in the Hive script.
  • Syntax highlights. The language service uses different colors to differentiate variables, keywords, data type, functions, and other programming elements:

Reader-only role

Users who are assigned the reader-only role for the cluster can’t submit jobs to the HDInsight cluster, nor view the Hive database. Contact the cluster administrator to upgrade your role to HDInsight Cluster Operator in the Azure portal. If you have valid Ambari credentials, you can manually link the cluster by using the following guidance.

Browse the HDInsight cluster

When you select the Azure HDInsight explorer to expand an HDInsight cluster, you’re prompted to link the cluster if you have the reader-only role for the cluster. Use the following method to link to the cluster by using your Ambari credentials.

Submit the job to the HDInsight cluster

When submitting job to an HDInsight cluster, you’re prompted to link the cluster if you’re in the reader-only role for the cluster. Use the following steps to link to the cluster by using Ambari credentials.

Link to the cluster

  1. Enter a valid Ambari username.
  2. Enter a valid password.

You can use Spark / Hive: List Cluster to check the linked cluster:


HPE Ezmeral Data Fabric (formerly MapR Data Platform)

MapR, the company, was commonly seen as the third horse – or should that be elephant? – in a race with Cloudera and Hortonworks. The latter two have merged, while MapR has, essentially, gone out of business.

It was announced on 5 August 2019 that Hewlett Packard Enterprise (HPE) has acquired all MapR assets for an undisclosed sum.

MapR technologies have been used to allow Hadoop to perform well with potential and minimal effort. Their linchpin, the MapR filesystem that inherits HDFS API, is fully read/write and can save trillions of files.

MapR has done more than any other vendor to deliver reliable and efficient distribution for huge cluster implementation.

HPE Ezmeral Data Fabric XD Distributed File and Object Store

XD Cloud-Scale Data Store provides exabyte-scale data store for building intelligent applications with the Data-fabric Converged Data Platform. XD includes all the functionality you need to manage large amounts of conventional data.

Why XD?

XD can be installed on SSD- and HDD-based servers. It includes the filesystem for data storage, data management, and data protection, support for mounting and accessing the clusters using NFS and the FUSE-based POSIX (basic, platinum, or PACC) clients, and support for accessing and managing data using HDFS APIs. The cluster can be managed using the Control System and monitored using Data-fabric Monitoring (Spyglass initiative). XD is the only Cloud-Scale Data store that enables you to build a fabric of exabyte scale. XD supports trillions of files, 100s of 1000s of client nodes and can run on Edge Cluster, on-prem data centers and the public cloud.

Accessing filesystem with C Applications

MapR provides a modified version of libhdfs that supports access to the MapR filesystem. You can develop applications with C that read files, write to files, change file permissions and file ownership, create and delete files and directories, rename files, and change the access and modification times of files and directories.

libMapRClient supports and makes modifications to hadoop-2.x version of libhdfs. The API reference notes which APIs are supported by hadoop-2.x.

libMapRClient’s version of libhdfs contains the following changes and additions:

  • There are no calls to a JVM, so applications run faster and more efficiently.
  • Changes to APIs
    • hadoop-2.x: Support for hdfsBuilder structures for connections to HDFS is limited. Some of the parameters are ignored.
    • hadoop-2.x: hdfsGetDefaultBlockSize(): If the filesystem that the client is connected to is an instance of filesystem, the returned value is 256 MB, regardless of the actual setting.
    • hadoop-2.x: hdfsCreateDirectory(): The parameters for buffer size, replication, and block size are ignored for connections to MapR filesystem.
    • hadoop-2.x: hdfsGetDefaultBlockSizeAtPath(): If the filesystem that the client is connected to is an instance of filesystem, the returned value is 256 MB, regardless of the actual setting.
    • hadoop-2.x: hdfsOpenFile(): The parameters for buffer size and replication are ignored for connections to MapR filesystem.
  • APIs that are unique to libMapRClient for hadoop-2.x
    • hdfsCreateDirectory2()
    • hdfsGetNameContainerSizeBytes()
    • hdfsOpenFile2()
    • hdfsSetRpcTimeout()
    • hdfsSetThreads()

Compiling and Running a Java Application

You can compile and run the Java application using JAR files from the MapR Maven repository or from the MapR installation.

Using JARs from the MapR Maven Repository

MapR Development publishes Maven artifacts from version 2.1.2 onward at https://repository.mapr.com/maven/. When compiling for MapR 6.1, add the following dependency to the pom.xml file for your project:

<dependency>
   <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-common</artifactId>
  <version>2.7.0-mapr-1808</version>
</dependency>

This dependency will pull the rest of the dependencies from the MapR Maven repository the next time you do a mvn clean install.The JAR that includes the maprfs library is a dependency for the hadoop-common artifact.

For a complete list of MapR-provided artifacts and further details, see Maven Artifacts for the HPE Ezmeral Data Fabric.

Using JARs from the MapR Installation

The maprfs library is included in the hadoop classpath. Add the hadoop classpath to the JAVA classpath when you compile and run the Java application.

  • To compile the sample code, use the following command:
javac -cp $(hadoop classpath) MapRTest.java
  • To run the sample code, use the following command:
java -cp .:$(hadoop classpath) MapRTest /test

Copy Data Using the hdfs:// Protocol

Describes the procedure to copy data from a HDFS cluster to a data-fabric cluster using the hdfs:// protocol.

Before you can copy data from an HDFS cluster to a data-fabric cluster using the hdfs:// protocol, you must configure the data-fabric cluster to access the HDFS cluster. To do this, complete the steps listed in Configuring a MapR Cluster to Access an HDFS Cluster for the security scenario that best describes your HDFS and data-fabric clusters and then complete the steps listed under Verifying Access to an HDFS Cluster.

You also need the following information:

  • <NameNode> – the IP address or hostname of the NameNode in the HDFS cluster
  • <NameNode Port> – the port for connecting to the NameNode in the HDFS cluster
  • <HDFS path> – the path to the HDFS directory from which you plan to copy data
  • <MapRFilesystem path> – the path in the data-fabric cluster to which you plan to copy HDFS data
  • <file> – a file in the HDFS path

To copy data from HDFS to data-fabric filesystem using the hdfs:// protocol, complete the following steps:

  1. Run the following hadoop command to determine if the data-fabric cluster can read the contents of a file in a specified directory on the HDFS cluster:
hadoop fs -cat <NameNode>:<NameNode port>/<HDFS path>/<file>

Example

hadoop fs -cat hdfs://nn1:8020/user/sara/contents.xml
  1. If the data-fabric cluster can read the contents of the file, run the distcp command to copy the data from the HDFS cluster to the data-fabric cluster:
hadoop distcp hdfs://<NameNode>:<NameNode Port>/<HDFS path> maprfs://<MapRFilesystem path>

Example

hadoop distcp hdfs://nn1:8020/user/sara maprfs:///user/sara

Copying Data Using NFS for the HPE Ezmeral Data Fabric

Describes how to copy files from one data-fabric cluster to another using NFS for the HPE Ezmeral Data Fabric.

If NFS for the HPE Ezmeral Data Fabric is installed on the data-fabric cluster, you can mount the data-fabric cluster to the HDFS cluster and then copy files from one cluster to the other using hadoop distcp. If you do not have NFS for the HPE Ezmeral Data Fabric installed and a mount point configured, see Accessing Data with NFS v3 and Managing the Data Fabric NFS Service.

To perform a copy using distcp via NFS for the HPE Ezmeral Data Fabric, you need the following information:

  • <MapR NFS Server> – the IP address or hostname of the NFS server in the data-fabric cluster
  • <maprfs_nfs_mount> – the NFS export mount point configured on the data-fabric cluster; default is /mapr
  • <hdfs_nfs_mount> – the NFS for the HPE Ezmeral Data Fabric mount point configured on the HDFS cluster
  • <NameNode> – the IP address or hostname of the NameNode in the HDFS cluster
  • <NameNode Port> – the port on the NameNode in the HDFS cluster
  • <HDFS path> – the path to the HDFS directory from which you plan to copy data
  • <MapR filesystem path> – the path in the data-fabric cluster to which you plan to copy HDFS data

To copy data from HDFS to the data-fabric filesystem using NFS for the HPE Ezmeral Data Fabric, complete the following steps:

  1. Mount HDFS.

Issue the following command to mount the data-fabric cluster to the HDFS NFS for the HPE Ezmeral Data Fabric mount point:

mount <Data Fabric NFS Server>:/<maprfs_nfs_mount> /<hdfs_nfs_mount>

Example

mount 10.10.100.175:/mapr /hdfsmount
  1. Copy data.
    1. Issue the following command to copy data from the HDFS cluster to the data-fabric cluster:
hadoop distcp hdfs://<NameNode>:<NameNode Port>/<HDFS path> file:///<hdfs_nfs_mount>/<MapR filesystem path>

Example

hadoop distcp hdfs://nn1:8020/user/sara/file.txt file:///hdfsmount/user/sara
  1. Issue the following command from the data-fabric cluster to verify that the file was copied to the data-fabric cluster:
hadoop fs -ls /<MapR filesystem path>

Example

hadoop fs -ls /user/sara

HPE Ezmeral Data Fabric Database

HPE Ezmeral Data Fabric Database is an enterprise-grade, high-performance, NoSQL database management system. You can use it for real-time, operational analytics capabilities.

Why HPE Ezmeral Data Fabric Database?

HPE Ezmeral Data Fabric Database is built into the data-fabric platform. It requires no additional process to manage, leverages the same architecture as the rest of the platform, and requires minimal additional management.

How Do I Get Started?

Based on your role, review the HPE Ezmeral Data Fabric Database documentation. The following table identifies useful resources based on your role.


IBM InfoSphere Insights

IBM assimilates a capital of key data management parts and analytics assets into open-source distribution. The company has also launched a determined, open-source project Apache System ML for Machine Learning.

With IBM BigInsights, customers get to market in a very fast pace with their apps integrating advanced Big Data Analytics.

Hadoop vendors endure developing over time with rising universal implementation of technologies relating to Big Data and with increasing retailers’ profits. However, these Hadoop merchants are facing a rough struggle in the Big Data world, and it is complicated for the firms to select the best-suited tool for the organization out of a wide range of players.

IBM InfoSphere BigInsights

InfoSphere® BigInsights™ is a software platform for discovering, analyzing, and visualizing data from disparate sources. You use this software to help process and analyze the volume, variety, and velocity of data that continually enters your organization every day.

InfoSphere BigInsights helps your organization to understand and analyze massive volumes of unstructured information as easily as smaller volumes of information. The flexible platform is built on an Apache Hadoop open source framework that runs in parallel on commonly available, low-cost hardware. You can easily scale the platform to analyze hundreds of terabytes, petabytes, or more of raw data that is derived from various sources. As information grows, you add more hardware to support the influx of data.

InfoSphere BigInsights helps application developers, data scientists, and administrators in your organization quickly build and deploy custom analytics to capture insight from data. This data is often integrated into existing databases, data warehouses, and business intelligence infrastructure. By using InfoSphere BigInsights, users can extract new insights from this data to enhance knowledge of your business.

InfoSphere BigInsights incorporates tooling for numerous users, speeding time to value and simplifying development and maintenance:

  • Software developers can use the Eclipse-based plug-in to develop custom text analytic functions to analyze loosely structured or largely unstructured text data.
  • Administrators can use the web-based management console to inspect the status of the software environment, review log records, assess the overall health of the system, and more.
  • Data scientists and business analysts can use the data analysis tool to explore and work with unstructured data in a familiar spreadsheet-like environment.

Analytic Applications

InfoSphere® BigInsights™ provides distinct capabilities for discovering and analyzing business insights that are hidden in large volumes of data. These technologies and features combine to help your organization manage data from the moment that it enters your enterprise.

By combining these technologies, InfoSphere BigInsights extends the Hadoop open source framework with enterprise-grade security, governance, availability, integration into existing data stores, tools that simplify developer productivity, and more.

Hadoop is a computing environment built on top of a distributed, clustered file system that is designed specifically for large-scale data operations. Hadoop is designed to scan through large data sets to produce its results through a highly scalable, distributed batch processing system. Hadoop comprises two main components: a file system, known as the Hadoop Distributed File System (HDFS), and a programming paradigm, known as Hadoop MapReduce. To develop applications for Hadoop and interact with HDFS, you use additional technologies and programming languages such as Pig, Hive, Jaql, Flume, and many others.

Apache Hadoop helps enterprises harness data that was previously difficult to manage and analyze. InfoSphere BigInsights features Hadoop and its related technologies as a core component.

MapReduce

MapReduce applications can process large data sets in parallel by using a large number of computers, known as clusters.

In this programming paradigm, applications are divided into self-contained units of work. Each of these units of work can be run on any node in the cluster. In a Hadoop cluster, a MapReduce program is known as a job. A job is run by being broken down into pieces, known as tasks. These tasks are scheduled to run on the nodes in the cluster where the data exists.

Applications submit jobs to a specific node in a Hadoop cluster, which is running a program known as the JobTracker. The JobTracker program communicates with the NameNode to determine where all of the data required for the job exists across the cluster. The job is then broken into map tasks and reduce tasks for each node in the cluster to work on. The JobTracker program attempts to schedule tasks on the cluster where the data is stored, rather than sending data across the network to complete a task. The MapReduce framework and the Hadoop Distributed File System (HDFS) typically exist on the same set of nodes, which enables the JobTracker program to schedule tasks on nodes where the data is stored.

As the name MapReduce implies, the reduce task is always completed after the map task. A MapReduce job splits the input data set into independent chunks that are processed by map tasks, which run in parallel. These bits, known as tuples, are key/value pairs. The reduce task takes the output from the map task as input, and combines the tuples into a smaller set of tuples.

A set of programs that run continuously, known as TaskTracker agents, monitor the status of each task. If a task fails to complete, the status of that failure is reported to the JobTracker program, which reschedules the task on another node in the cluster.

This distribution of work enables map tasks and reduce tasks to run on smaller subsets of larger data sets, which ultimately provides maximum scalability. The MapReduce framework also maximizes parallelism by manipulating data stored across multiple clusters. MapReduce applications do not have to be written in Java™, though most MapReduce programs that run natively under Hadoop are written in Java.

Hadoop MapReduce runs on the JobTracker and TaskTracker framework from Hadoop version 1.1.1, and is integrated with the new Common and HDFS features from Hadoop version 2.2.0. The MapReduce APIs used in InfoSphere® BigInsights™ are compatible with Hadoop version 2.2.0.

InfoSphere BigInsights supports various scenarios

Predictive modeling

Predictive modeling refers to uncovering patterns to help make business decisions, such as forecasting a propensity for fraud, or determining how pricing affects holiday candy sales online. Finding patterns has traditionally been at the core of many businesses, but InfoSphere® BigInsights™ provides new methods for developing predictive models.

Banking: Fraud reduction

A major banking institution developed fraud models to help mitigate risk of credit card fraud. However, the models took up to 20 days to develop by using traditional methods. By understanding fraud patterns nearly one month after an incident occurred, the bank was only partially able to detect fraud.

The bank used InfoSphere BigInsights to create models to help uncover patterns of events within a customer’s life that correlate to fraud. Events such as divorce, home foreclosure, and job loss could be tracked and incorporated into fraud models. These additional insights helped the bank to develop more accurate models in a fraction of the time. More robust and current fraud models helped the bank to detect existing fraud patterns and stop additional fraud before incurring losses.

Healthcare: Improved patient care

A healthcare insurance provider completed analysis on more than 400 million insurance claims to determine the potential dangers of interacting drugs. The IT organization developed a system to analyze large sets of patient data against a database of drugs, including their interactions with other drugs. Because the patient reference and treatment data is complex and intricately nested, the analysis could take more than 100 hours per data set.

The insurance provider used InfoSphere BigInsights to develop a solution that reduced analysis from 100 hours to 10 hours. By cross-referencing a list of prescriptions for each patient with an external service for known interaction problems with different drugs, the service can flag potential conflicts of drug to drug, drug to disease, and drug to allergy interactions. The insurance provider can provide more quality recommendations for each patient and lower the overall cost of care.

Retail: Targeted marketing

A large retailer wanted to better understand customer behavior in physical stores versus behavior in the online marketplace to improve marketing in both spaces. Gaining insight required exploring massive amounts of web logs and data from physical stores. All of this information was available in the data warehouse, but the retailer could not efficiently map the disparate data.

By using InfoSphere BigInsights, the retailer parsed varying formats of web log files and mapped that information in the data warehouse, linking buying behavior online with behavior in physical stores. Predictive analytics helped to distinguish patterns across these dimensions. BigSheets was used to visualize and interact with the results to determine which elements to track. The data from these elements was cleansed in extract-transform-load (ETL) jobs and loaded into a data warehouse, enabling the retailer to act on information about buying habits. The retailer used the results to develop more targeted marketing, which led to increased sales.

Consumer sentiment insight

Developing consumer insight involves uncovering consumer sentiments for brand, campaign, and promotions management. InfoSphere® BigInsights™ helps to derive customer sentiment from social media messages, product forums, online reviews, blogs, and other customer-driven content.

Retail: Consumer goods

A large company specializing in drinks spent a substantial portion of its budget to market its brands. The company had competitive products in the soft drink, bottled water, and sports drink markets. To better understand customer sentiment and brand perception, the company wanted to track information in online social media forums. Positive and negative commentary, discussion around products, and perception of spokespersons needed to be gauged to help the company better target its marketing campaigns and promotions.

The company used a third-party tool to provide a view of this information, but concerns arose about the validity of the results. The tool could not analyze large volumes of information, so the overall view was only a portion of the data that could be incorporated. In addition, the software could not understand inherent meanings within text, or focus on relevant topics without stumbling over literal translations of misspellings, syntax, or jargon.

By using InfoSphere BigInsights, the company aggregated massive amounts of information from social media sites such as blogs, message boards, forums, and news feeds. The company used the InfoSphere BigInsights text analytics capabilities to sift through the collected information and find relevant discussions about their products. The analysis determined which discussions were favorable or not favorable, including whether the conversation was about one of the company spokespersons, or about another person with the same name. Having the right granularity and relevance from a large sample of data helped the company to obtain useful insights that were used to improve customer sentiment and enhance brand perception across their line of products.

Research and business development

Research and development involves designing and developing products that adapt to competitor offerings, customer demand, and innovation. InfoSphere® BigInsights™ helps to collect and analyze information that is critical to implementing future projects, and turning project visions into reality.

Government: Data archiving

A government library wanted to preserve the digital culture of the nation as related websites are published, modified, or removed daily. The information would be incorporated into a web archive portal that researchers and historians would use to explore preserved web content. Research analysts classified several thousand websites with the country extension by using manual methods and customized tools. The estimate to manually archive the entire web domain, comprising more than 4 million websites, would be costly.

The government library used InfoSphere BigInsights and a customized classification module to electronically classify and tag web content and create visualizations across numerous commodity computers in parallel. The solution drastically reduces the cost of archiving websites and helps to archive and preserve massive numbers of websites. As websites are added or modified, the content is automatically updated and archived so that researchers and historians can explore and generate new data insights.

Financial services: Acquisition management

A major credit card company wanted to automate the process of measuring the value of potential acquisitions. This comprehensive analysis needed to include public and private information, including patents and trademarks, company press releases, annual and quarterly reports, corporate genealogies, and IP ownership and patents ranked by citation. Because mergers and acquisitions take considerable time to complete, the company needed ongoing tracking capabilities. In addition, the solution had to include the ability to present information in a graphical format that business professionals can comprehend easily.

The company used InfoSphere BigInsights to improve intellectual property analysis for mergers and acquisitions by gathering information about millions of patents from the web. Millions of citations were also extracted, and then correlated with the patent records. The correlations were the basis for a ranking value measurement. The more a patent is referenced, the greater value or usefulness it has.

The analysis can be completed in hours, compared with weeks that would be required if manual methods were used. Ongoing tracking and refresh capabilities allow business professionals to review updates as they occur, providing current and comprehensive insights into potential acquisitions.