Hadoop In Depth

Procedure for installing Hadoop(Single Node) on Ubuntu 14.04

with openjdk-7

In this tutorial i just explain steps to install and configure hadoop1.2.1 in ubuntu 14.04 with openjdk-7.

Step 1: Prerequisites

1.Download Hadoop 1.2.1

Hadoop 1.2.1 can be downloaded from here.Please download a stable version of Hadoop.

2.Install openjdk-7

Hadoop need a working Java for that open a new terminal and run the following commands.

bimal@bimal:~$ sudo apt-get update

bimal@bimal:~$ sudo apt-get upgrade

bimal@bimal:~$ sudo apt-get install openjdk-7-jdk

After the installation of Java we need to add JAVA_HOME to Ubuntu environment for that purpose you need to edit /etc/environment.

bimal@bimal:~$ sudo gedit /etc/environment

Then you need to append the following line to the file.

JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

3.Add a dedicated user

we need to add a dedicated user(hduser) and also create a group hadoop because during the installation and configuration of hadoop we need a sepration and privecy from other users.

bimal@bimal:~$ sudo addgroup hadoop
bimal@bimal:~$ sudo adduser --ingroup hadoop hduser

4.Installing and configuring SSH Server

Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it you need to configure SSH server.For configuring SSH server use the following commands.

First you need to install openssh server by using the following command.

bimal@bimal:~$ sudo apt-get install openssh-server

After installation you need to open a new terminal and switch the user to hduser by using the following commands.

bimal@bimal:~$ su – hduser

After switching to hduser you need to create the SSH key using the following commands.

hduser@bimal:~$ ssh-keygen -t rsa -P ""

The above command will produce the RSA key.The above output shows the RSA key.

Then we need to enable SSH by using the following command.The below command will copy the key to authorized_keys.

hduser@bimal:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

The last and final step is to connect our local machine with hduser.The step is also needed to save our local machine’s host key fingerprint to the hduser user’s known_hosts file.

hduser@bimal:~$ ssh localhost

If you get any error during the above step trying to reinstall the openssh-server.

5. Disable IPV6

Before going to disable ipv6 you need to repeatdly type exit and open a new terminal and edit the file /etc/sysctl.conf using the following command.

hduser@bimal:~$ sudo gedit /etc/sysctl.conf

Add the the following line of code at the end of the file

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

You need to restart you system to get the effect of changing the file.

You can check whether the ipv6 is disabled or not by using the following command

hduser@bimal:~$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

If it return a value 0 then ipv6 is not disabled.If it return a value 1 it means ipv6 is disabled.

Step 2 .Install Hadoop

1.Extract and Modify Permissions

First move the hadoop package to /usr/local.Then change the directory to /usr/local.Then extract the package using tar command.Then move the extraced file to the directory hadoop after that change the owner of the hadoop directory, all files and directorys in it.The following commands are using for this purpose.

sudo mv /home/bimal/Downloads/hadoop-1.2.1.tar.gz /usr/local
cd /usr/local
sudo tar xzf hadoop-1.2.1.tar.gz
sudo mv hadoop-1.2.1 hadoop
sudo chown –R hduser:hadoop hadoop

Note:Sometimes -R option make some error the just use option –recursive

2. Update ‘$HOME/.bashrc’ of hduser

first open $HOME/.bashrc of hduser

hduser@bimal:~$ sudo gedit /home/hduser/.bashrc

After opening the file you need to append the following line of code to the end of the file.

# Set Hadoop-related environment variables

export HADOOP_PREFIX=/usr/local/hadoop

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

# Some convenient aliases and functions for running Hadoop-
#related commands

unalias fs &> /dev/null

alias fs="hadoop fs"

unalias hls &> /dev/null

alias hls="fs -ls"

# If you have LZO compression enabled in your Hadoop cluster #and compress job outputs with #LZOP (not covered in this #tutorial): Conveniently inspect an LZOP compressed file from the #command

# line; run via:

# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo

# Requires installed 'lzop' command.

lzohead () {

hadoop fs -cat $1 | lzop -dc | head -1000 | less

}

# Add Hadoop bin/ directory to PATH

export PATH=$PATH:$HADOOP_PREFIX/bin

Step 3: Configuring Hadoop

Now we have to configure the directory where Hadoop will store its data files, the network ports it listens to, etc.

1.Assigning working directory

We will use the directory ‘/app/hadoop/tmp’ . Hadoop’s default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS. Now we create the directory and set the required ownerships and permissions.

hduser@bimal:~$ sudo mkdir -p /app/hadoop/tmp
hduser@bimal:~$ sudo chown hduser:hadoop /app/hadoop/tmp
hduser@bimal:~$ sudo chmod 750 /app/hadoop/tmp

If you forget to set the required ownerships and permissions, you will see a java.io.IOException when you try to format the name node in the next section.

2.Configuring Hadoop setup files

I. hadoop-env.sh

The only required environment variable we have to configure for Hadoop is JAVA_HOME.

Replace

# The java implementation to use. Required.

#export JAVA_HOME=/usr/lib/jvm/

With

# The java implementation to use. Required.

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

ii.core-site.xml

open the file core-site.xml and add the following lines of code between <configuration>...</configuration>.

hduser@bimal:~$ sudo gedit /usr/local/hadoop/conf/core-site.xml

<name>hadoop.tmp.dir</name>

<value>/app/hadoop/tmp</value>

<description>A base for other temporary directories.</description>

</property>

<name>fs.default.name</name>

<value>hdfs://localhost:54310</value>

<description>The name of the default file system. A URI whose

scheme and authority determine the FileSystem implementation. The

uri's scheme determines the config property (fs.SCHEME.impl) naming

the FileSystem implementation class. The uri's authority is used to

determine the host, port, etc. for a filesystem.</description>

</property>

III. mapred-site.xml

Open file /usr/local/hadoop/conf/mapred-site.xml and we need to append the following code between <configuration>...</configuration>.

hduser@bimal:~$ sudo gedit /usr/local/hadoop/conf/mapred-site.xml

<name>mapred.job.tracker</name>

<value>localhost:54311</value>

<description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.</description>

</property>

IV. hdfs-site.xml

Open up the file /usr/local/hadoop/conf/hdfs-site.xml

hduser@bimal:~$ sudo gedit /usr/local/hadoop/conf/hdfs-site.xml

Add the following lines of code between the <configuration>...</configuration>

<name>dfs.replication</name>

<description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description>

</property>

Step 4: Formatting the HDFS Filesystem via the Namenode

open a new terminal and switch to hduser