Skip to content

Installing Hadoop

    Hadoop should be downloaded in the master server using the following procedure.

    # mkdir /opt/hadoop
    # cd /opt/hadoop/
    # wget http://apache.mesi.com.ar/hadoop/common/hadoop-1.2.1/hadoop-1.2.0.tar.gz
    # tar -xzf hadoop-1.2.0.tar.gz
    # mv hadoop-1.2.0 hadoop
    # chown -R hadoop /opt/hadoop
    # cd /opt/hadoop/hadoop/

    Configuring Hadoop

    Hadoop server must be configured in core-site.xml and should be edited where ever required.

    <configuration>
    <property>
    <name>fs.default.name</name><value>hdfs://hadoop-master:9000/</value>
    </property>
    <property>
    <name>dfs.permissions</name>
    <value>false</value>
    </property>
    </configuration>
    hdfs-site.xml file should be editted.
    <configuration>
    <property>
    <name>dfs.data.dir</name>
    <value>/opt/hadoop/hadoop/dfs/name/data</value>
    <final>true</final>
    </property>
    <property>
    <name>dfs.name.dir</name>
    <value>/opt/hadoop/hadoop/dfs/name</value>
    <final>true</final>
    </property>
    <property>
    <name>dfs.replication</name>
    <value>1</value>
    </property>
    </configuration>

    mapred-site.xml file should be edited as per the requirement example is being shown.

    <configuration>
    <property>
    <name>mapred.job.tracker</name><value>hadoop-master:9001</value>
    </property>
    </configuration>

    JAVA_HOME, HADOOP_CONF_DIR, and HADOOP_OPTS should be edited as follows:

    export JAVA_HOME=/opt/jdk1.7.0_17
    export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
    export HADOOP_CONF_DIR=/opt/hadoop/hadoop/conf

    Installing Hadoop on Slave Servers

    Hadoop should be installed on all the slave servers

    # su hadoop
    $ cd /opt/hadoop
    $ scp -r hadoop hadoop-slave-1:/opt/hadoop
    $ scp -r hadoop hadoop-slave-2:/opt/hadoop

    Configuring Hadoop on Master Server

    Master server configuration

    # su hadoop
    $ cd /opt/hadoop/hadoop
    Master Node Configuration
    $ vi etc/hadoop/masters
    hadoop-master

    Slave Node Configuration

    $ vi etc/hadoop/slaves
    hadoop-slave-1
    hadoop-slave-2

    Name Node format on Hadoop Master

    # su hadoop
    $ cd /opt/hadoop/hadoop
    $ bin/hadoop namenode –format
    11/10/14 10:58:07 INFO namenode.NameNode: STARTUP_MSG:
    ************************************************************
    STARTUP_MSG: Starting NameNode
    STARTUP_MSG: host = hadoop-master/192.168.1.109
    STARTUP_MSG: args = [-format]
    STARTUP_MSG: version = 1.2.0
    STARTUP_MSG: build =
    https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1479473; compiled by 'hortonfo' on Monday May 6 06:59:37 UTC 2013
    STARTUP_MSG: java = 1.7.0_71
    ************************************************************
    11/10/14 10:58:08 INFO util.GSet: Computing capacity for map BlocksMap editlog=/opt/hadoop/hadoop/dfs/name/current/edits
    ………………………………………………….
    11/10/14 10:58:08 INFO common.Storage: Storage directory /opt/hadoop/hadoop/dfs/name has been successfully formatted.
    11/10/14 10:58:08 INFO namenode.NameNode: SHUTDOWN_MSG:
    ************************************************************
    SHUTDOWN_MSG: Shutting down NameNode at hadoop-master/192.168.1.15
    ************************************************************

    Hadoop Services

    Starting Hadoop services on the Hadoop-Master procedure explains its setup.

    $ cd $HADOOP_HOME/sbin
    $ start-all.sh

    Addition of a New DataNode in the Hadoop Cluster is as follows:

    Networking

    Add new nodes to an existing Hadoop cluster with some suitable network configuration. Consider the following network configuration for new node Configuration:

    IP address : 192.168.1.103
    netmask : 255.255.255.0
    hostname : slave3.in

    Adding a User and SSH Access

    Add a user working under “hadoop” domain and the user must have the access added and password of Hadoop user can be set to anything one wants.

    useradd hadoop
    passwd hadoop

    To be executed on master

    mkdir -p $HOME/.ssh
    chmod 700 $HOME/.ssh
    ssh-keygen -t rsa -P '' -f $HOME/.ssh/id_rsa
    cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
    chmod 644 $HOME/.ssh/authorized_keys
    Copy the public key to new slave node in hadoop user $HOME directory
    scp $HOME/.ssh/id_rsa.pub [email protected]:/home/hadoop/

    Execution done on slaves

    su hadoop ssh -X [email protected]

    Content of public key must be copied into file “$HOME/.ssh/authorized_keys” and then the permission for the same must be changed as per the requirement.

    cd $HOME
    mkdir -p $HOME/.ssh
    chmod 700 $HOME/.ssh
    cat id_rsa.pub >>$HOME/.ssh/authorized_keys
    chmod 644 $HOME/.ssh/authorized_keys

    ssh login must be changed from the master machine. It is possible that the ssh to the new node without a password from the master must be verified.

    ssh [email protected] or [email protected]

    Setting  Hostname for New Node

    Hostname is setup in the file directory  /etc/sysconfig/network
    On new slave3 machine
    NETWORKING=yes
    HOSTNAME=slave3.in

    Machine must be restarted again or hostname command should be run under new machine with the corresponding hostname to make changes effectively.

    On slave3 node machine:

    hostname slave3.in
    /etc/hosts must be updated on all machines of the cluster
    192.168.1.102 slave3.in slave3

    ping the machine with hostnames to check whether it is resolving to IP address.

    ping master.in

    Start the DataNode on New Node

    Datanode daemon should be started manually using $HADOOP_HOME/bin/hadoop-daemon.sh script. Master (NameNode) should correspondingly join the cluster after automatically contacted. New node should be added to the configuration/slaves file in the master server. New node will be identified by script-based commands.

    Login to new node

    su hadoop or ssh -X [email protected]

    HDFS is started on a newly added slave node

    ./bin/hadoop-daemon.sh start datanode

     jps command output must be checked on a new node.

    $ jps
    7141 DataNode
    10312 Jps

    Removing a DataNode

    Node can be removed from a cluster while it is running, without any worries of data loss. A decommissioning feature is made available by HDFS which ensures that removing a node is performed securely.

    Step 1

    Login to master machine so that the user can check Hadoop is being installed.

    $ su hadoop

    Step 2

    Before starting the cluster an exclude file must be configured where a key named dfs.hosts.exclude should be added to our$HADOOP_HOME/etc/hadoop/hdfs-site.xmlfile.

    NameNode’s local file system contains a list of machines which are not permitted to connect to HDFS receives full path by this key and the value associated with it as follows.

    <property>
    <name>dfs.hosts.exclude</name><value>/home/hadoop/hadoop-1.2.1/hdfs_exclude.txt</value><description>>DFS exclude</description>
    </property>

    Step 3

    Hosts with respect to decommission are determined.

    File reorganization by the hdfs_exclude.txt for each and every machine to be decommissioned which will results in preventing them from connecting to the NameNode.

    slave2.in

    Step 4

    Force configuration reloads.

    “$HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes” should be run
    $ $HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes

    NameNode will be forced made to re-read its configuration, as this is inclusive for the newly updated ‘excludes’ file. Nodes will be decommissioned over a period of time intervals, and allowing time for each node’s blocks to be replicated onto machines which are scheduled to be active.jps command output should be checked on slave2.in. Once the work is done DataNode process will shutdown automatically.

    Step 5

    Shutdown nodes.

    The decommissioned hardware can be carefully shut down for maintenance purpose after the decommission process has been finished.

    $ $HADOOP_HOME/bin/hadoop dfsadmin -report

    Step 6

    Excludes are edited again and once the machines have been decommissioned, they are removed from the ‘excludes’ file. “$HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes” will read the excludes file back into the NameNode.

    Data Nodes will rejoin the cluster after the maintenance has been completed, or if additional capacity is needed in the cluster again is being informed.

    To run/shutdown tasktracker

    $ $HADOOP_HOME/bin/hadoop-daemon.sh stop tasktracker
    $ $HADOOP_HOME/bin/hadoop-daemon.sh start tasktracker

    Add a new node with the following steps

    1) Take a new system which gives access to create a new username and password

    2) Install the SSH and with master node setup ssh connections

    3) Add sshpublic_rsa id key having an authorized keys file

    4) Add the new data node hostname, IP address and other informative details in /etc/hosts slaves file192.168.1.102 slave3.in slave3

    5) Start the DataNode on the New Node

    6) Login to the new node command like suhadoop or Ssh -X [email protected]

    7) Start HDFS of newly added in the slave node by using the following command ./bin/hadoop-daemon.sh start data node

    8) Check the output of jps command on a new node.