Apache Hadoop — How to install and configure a cluster on Ubuntu 18.04
Hi people!!
In this tutorial, we will install and configure Hadoop on Ubuntu with single-node only.
But before we start, we need the following requirements:
- Ubuntu 18.04 (Can be used on a virtualized machine in VirtualBox);
- Java installed;
After installing ubuntu, to make sure you have Java installed, use the following command:
java -version
If not, use the following commands to install:
sudo apt-get install default-jre
sudo apt-get install default-jdk
Now you already have Java installed
1st step: Hadoop user configuration
To begin, let’s set up a new group of Hadoop users.
sudo addgroup hadoop
And we’re going to add a Hadoop user with the name hadoopuser
sudo adduser -–ingroup hadoop hadoopuser
sudo adduser hadoopuser sudo
2nd step: Install and configure OpenSSH
We will install OpenSSH using the following command:
sudo apt-get install openssh-server
Hadoop uses SSH to access nodes. In this case and since we are making a configuration for only a single-node, we need to configure SSH to access localhost.
We will log in with the user we created
su - hadoopuser
Next, we will generate an SSH public key for the hadoopuser
ssh-keygen -t rsa -P ""
Let’s add the key we generated earlier to the list of authorized_keys
cat $HOME/ .ssh/id_rsa.pub >> $HOME/ .ssh/authorized_keys
In order to make sure SSH is working, we will use the following command:
ssh localhost
Finally, use the command “exit” to close the connection.
3rd step: Install and configure Hadoop
We will now install Hadoop version 2.9.1
sudo wget -P /home/arga/Desktop/ https://archive.apache.org/dist/hadoop/common/hadoop-2.9.1/hadoop-2.9.1.tar.gz
Let’s unzip the Hadoop folder
cd /home/arga/Desktop
sudo tar xvzf hadoop-2.9.1.tar.gz
After unzipping, we will move it to the following directory:
sudo mv hadoop-2.9.1 /usr/local/hadoop
We will assign ownership of the ‘hadoop’ folder to the hadoopuser user
sudo chown -R hadoopuser /usr/local
We then move on to the configuration of several files.
To start the configuration we will use the following command to open the bashrc file:
sudo gedit ~/.bashrc
Copy the following settings to the end of the file:
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS=""
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
Save the file and use the following command:
source ~/.bashrc
Now let’s edit the file hadoop-env.sh and define the JAVA_HOME variable
cd /usr/local/hadoop/
sudo nano hadoop-env.sh
And put the following configuration:
Now, let’s set up a simple-node for a cluster
For this we will use the following files:
- core-site.xml
sudo nano core-site.xml
Put the following property in the file, within the configuration:
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
- hdfs-site.xml
sudo nano hdfs-site.xml
Put the following properties in the file:
<property>
<name>dfs.replication</name>
<value>1</value>
</property><property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_tmp/hdfs/namenode</value>
</property><property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_tmp/hdfs/datanode</value>
</property>
- yarn-site.xml
sudo nano yarn-site.xml
Put the following properties:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property><property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
- mapred-site.xml
Although the file name is mapred-site.xml.template, we will rename the file to mapred-site.xml, according to the following command:
sudo cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml
Then, let’s open the file to edit it:
sudo nano mapred-site.xml
Place the following property:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
Now, let’s create the directories for the namenode and the datanode, using the following commands:
cd
mkdir -p /usr/local/hadoop_space/hdfs/namenode
mkdir -p /usr/local/hadoop_space/hdfs/datanode
Now, let’s format the namenode, first assigning ownership of the hadoop_space folder to the hadoopuser user, using the following command:
sudo chown -R hadoopuser /usr/local/hadoop_space
cd
hdfs namenode -format
Next we will start the hadoop services:
start-all.sh
To verify that all services started correctly, run the following command:
jps
To finish, just access the following URL:
http://localhost: 8088/cluster