Linux, Windows, Virtualization, OpenSource & Blogging

Install Apache Hadoop on Debian 9 / Ubuntu 16.04 / CentOS 7 (Single Node Cluster)

1

Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed process, and it handles the very large size of data sets by distributing it across computer clusters.

Rather than rely on hardware high availability, Hadoop modules are designed to detect and handle the failure at the application layer, so gives you high-available service.

Hadoop framework consists of following modules,

  •  Hadoop Common – It contains common set of libraries and utilities that support  other Hadoop modules
  •  Hadoop Distributed File System (HDFS) – is a Java-based distributed file-system that stores data, providing very high-throughput to the application.
  •  Hadoop YARN –  It manages resources on compute clusters and using them for scheduling user’s applications.
  • Hadoop MapReduce – is a framework for large-scale data processing.

This guide will help you to get apache Hadoop installed on Debian 9 / Ubuntu 16.04 / CentOS 7. Also, this should work on Ubuntu 14.04.

Prerequisites

Switch to the root user.

su -

OR

sudo su -

Apache Hadoop requires Java version 8 and above. So, you can choose to install either OpenJDK or Oracle JDK.

Here, for this demo, I will be installing OpenJDK 8.

### Debian 9 / Ubuntu 16.04 ###

apt-get -y install openjdk-8-jdk wget

### CentOS 7 / RHEL 7 ###

yum -y install java-1.8.0-openjdk wget

Create Hadoop user

It is recommended to create a regular user to configure and run Apache Hadoop. So, create a user named “hadoop” and set a password.

useradd -m -d /home/hadoop -s /bin/bash hadoop

passwd hadoop

Once you created a user, configure a passwordless ssh to the local system. Create an ssh key using the following command

# su - hadoop

$ ssh-keygen

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

$ chmod 600 ~/.ssh/authorized_keys

Verify the passwordless communication to the local system. If you are doing ssh for the first time, type “yes” to add RSA keys to known hosts.

$ ssh 127.0.0.1

Download Hadoop

You can visit Apache Hadoop page to download the latest Hadoop package, or you can just issue the following command in terminal to download Hadoop 2.8.1.

$ wget http://apache.cs.utah.edu/hadoop/common/hadoop-2.8.1/hadoop-2.8.1.tar.gz

$ tar -zxvf hadoop-2.8.1.tar.gz

$ mv hadoop-2.8.1 hadoop

Install Apache Hadoop

Hadoop supports three modes of clusters

  1.     Local (Standalone) Mode – It runs as single java process.
  2.     Pseudo-Distributed Mode – Each Hadoop daemon runs in a separate process.
  3.     Fully Distributed Mode – Actual multinode cluster ranging from few nodes to extremely large cluster.

Setup environmental variables

Here, we will be configuring Hadoop in Pseudo-Distributed mode. To start with, set an environmental variables in the ~/.bashrc file.

$ vi ~/.bashrc

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.144-0.b01.el7_4.x86_64/jre/ # Change it depends on JAVA installation directory
export HADOOP_HOME=/home/hadoop/hadoop # Change it depends on Hadoop installation directory
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Apply environmental variables to the current session.

$ source ~/.bashrc

Modify Configuration files

Edit the Hadoop environmental file.

vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Set JAVA_HOME environment variable.

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.144-0.b01.el7_4.x86_64/jre/

Hadoop has many configuration files, and we need to edit them depends on the cluster modes we set up (Pseudo-Distributed).

$ cd $HADOOP_HOME/etc/hadoop

Edit core-site.xml

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

Edit hdfs-site.xml

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>

<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>

Edit mapred-site.xml

$ cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Edit yarn-site.xml

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

Now format the NameNode using the following command. Do not forget to check the storage directory.

$ hdfs namenode -format

Firewall

Allow Apache Hadoop through the firewall.

FirewallD:

firewall-cmd --permanent --add-port=50070/tcp
firewall-cmd --permanent --add-port=8088/tcp
firewall-cmd --reload

UFW:

ufw allow 50070/tcp
ufw allow 8088/tcp
ufw reload

Start NameNode daemon and DataNode daemon by using the scripts in the /sbin directory, provided by Hadoop.

$ cd $HADOOP_HOME/sbin/
$ start-dfs.sh

Open your web browser and browse the NameNode at

http://your-ip-address:50070/
Install Apache Hadoop on Debian 9 - Hadoop NameNode Information
Install Apache Hadoop on Debian 9 – Hadoop NameNode Information

Start ResourceManager daemon and NodeManager daemon.

$ start-yarn.sh

Browse the web interface for the ResourceManager at

http://your-ip-address:8088/
Install Apache Hadoop on Debian 9 - Yarn
Install Apache Hadoop on Debian 9 – Yarn

Testing Hadoop single node cluster

Before carrying out the upload, let us create a directory at HDFS.

$ hdfs dfs -mkdir /raj

Let us upload a file into HDFS directory called “raj”

$ hdfs dfs -put ~/.bashrc /raj

Uploaded files can be viewed by visiting the following URL or Utilities –> Browse the file system in NameNode.

http://your-ip-address:50070/explorer.html#/raj
Install Apache Hadoop on Debian 9 - Hadoop FS
Install Apache Hadoop on Debian 9 – Hadoop FS

Copy the files from HDFS to your local file systems.

$ hdfs dfs -get /raj /tmp/

You can delete the files and directories using the following commands.

hdfs dfs -rm  /raj/messages
hdfs dfs -r -f /raj

That’s All. You have successfully configured single node Hadoop cluster.

You might also like
  • Alex He

    Hi, I followed your instruction, however, namenode cannot be started when i try to format namenode. (I installed jdk1.8.0_40).
    The error message is showing as:
    15/04/11 16:32:13 FATAL namenode.NameNode: Fialed to start namenode
    java.lang.IllegalArgumentException: URI has an authority component
    at java.io.File.(File.java:423)
    at org.apache.hadoop.hdfs.server.namenode.NNSStorage.getStorageDirectory(NNStorage.java:329)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.initJournals(FSEditLog.java: 270)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.initJournalsForWrite(FSEditLog.java:241)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:935)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1379)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1504)
    15/04/11 16:32:13 INFO util.ExitUtil: Exiting with status 1
    15/04/11 16:32:14 INFO namenode.NameNode: SHUTDOWN_MSG:
    /************************************************************
    SHUTDOWN_MSG: Shutting down NameNode at ThinkPad-Edge-E540/127.0.1.1
    ************************************************************/
    Could you advise what’s wrong here? Thanks.

Install Apache Hadoop on Debian 9 / Ubuntu 16.04 / CentOS 7 (Single Node Cluster)

1