Apache Hadoop — How to implement a Multi Node Distributed Plataform

Afonso Antunes
7 min readMay 30, 2021

In this tutorial, we will implement a distributed platform using Apache Hadoop on Ubuntu. In this case, we will use one machine as a primary and two as a secondary.

For this, we will have to use the following requirements:

  • Ubuntu 18.04 version (Can be used on a virtualized machine in VirtualBox);
  • Java 8 installed.

If you use Virtualbox, before connecting the virtual machine, we will have to configure the network using the following figure:

1st step: Install ssh

Open the command line and use the following command:

sudo apt install ssh

2nd step: Install pdsh

sudo apt install pdsh

3rd step: Configure .bashrc

To access this file, we will use the following command:

nano .bashrc

At the end of the file, we will make the assignment to the ssh variable, adding the following line:

export PDSH_RCMD_TYPE=ssh

Save the file using CTRL + O and then click ENTER. To exit, use CTRL + X.

4th step: Generate an ssh key

Returning to the command line, we will generate an ssh key

ssh-keygen -t rsa -P ""

5th step: Copy the generated key

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

6th step: Check the ssh configuration

Now let’s check if the ssh is well configured

ssh localhost

7th step: Install Java version 8

sudo apt install openjdk-8jdk

8th step: Verify the Java installation

To check if java has been installed, just use this command:

java -version

9th step: Install Hadoop

To install Hadoop, we will use the following command:

sudo wget -P /home/arga/Desktop https://mirrors.sonic.net/apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz

10th step: Unzip Hadoop

So, let’s go back to Hadoop to unzip:

cd Desktop
tar xzf hadoop-3.2.1.tar.gz

11th step: Rename the folder

We will rename the folder in order to make it easier to access it

mv hadoop-3.2.1 hadoop

12th step: Configure the hadoop-env.sh file

Let’s open that file using the following commands:

cd hadoop
cd etc
cd hadoop
nano hadoop-env.sh

And let’s add the following line to this file:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

And don’t forget to save the file!! (CTRL + O, ENTER, CTRL + X)

13th step: Change the Hadoop folder directory

We will now move the Hadoop folder for easy use

sudo mv hadoop /usr/local/hadoop

To check if it has really been moved, let’s use this command:

cd

As you can see, the Hadoop folder has been moved successfully.

14th step: Configure the environment file

So let’s move on to environment file configuration

sudo nano /etc/environment

And in the file add the following configuration that is only in bold:

PATH=”/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/hadoop/bin:/usr/local/hadoop/sbinJAVA_HOME=”/usr/lib/jvm/java-8-openjdk-amd64/jre”

Save the file!! (CTRL + O, ENTER, CTRL + X)

15th step: Create a Hadoop user

So, let’s move on to creating the Hadoop user

sudo adduser hadoopuser

16th step: Use the following commands

sudo usermod -aG hadoopuser hadoopuser
sudo chown hadoopuser:root -R /usr/local/hadoop/
sudo chmod g+rwx -R /usr/local/hadoop/
sudo adduser hadoopuser sudo

17th step: Check the IP Address

Let’s check the IP Address of the virtual machine. For this, we will use this command:

ip addr

As you can see, my machine’s IP is 192.168.56.103.

In order to do the multinode configuration, we will use the following IP:

  • master: 192.168.56.103
  • slave1: 192.168.56.104
  • slave2: 192.168.56.105

18th step: Configure the hosts file

sudo nano /etc/hosts

And save the file!! (CTRL + O, ENTER, CTRL + X)

19th step: Create slaves

After making the necessary settings in the previous step, we will turn off the machine, which is the master, and we will clone it.

Para fazer esse passo, basta fazer CTRL + O and follow these steps that appear in the following figures:

We will repeat the same process because there are 2 secondary machines that we will want.

20th step: Configure the hostname file

We will now connect the 3 machines and we will change the hostname of each machine. For this, we will use the following command:

sudo nano /etc/hostname

And change the name of the master machine:

Then, it is necessary to restart the machine so that this configuration has been properly updated.

sudo reboot

Do exactly the same step, but for slaves machines.

21st step: Configure SSH in the master

su - hadoopuser

22nd step: Generate (again) an ssh key

ssh-keygen -t rsa

Next, we’ll copy that key for all users

ssh-copy-id hadoopuser@master
ssh-copy-id hadoopuser@slave1
ssh-copy-id hadoopuser@slave2

23rd step: Configure core-site.xml — master

sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml

Add the following code:

<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>

24th step: Configure hdfs-site.xml — master

sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml

Add the following code:

<property>
<name>dfs.namenode.name.dir</name><value>/usr/local/hadoop/data/nameNode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name><value>/usr/local/hadoop/data/dataNode</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>

25th step: Configure workers file — master

sudo nano /usr/local/hadoop/etc/hadoop/workers

Add the following code:

slave1
slave2

26th step: Copy the master configuration files to the slaves

scp /usr/local/hadoop/etc/hadoop/* slave1:/usr/local/hadoop/etc/hadoop/
scp /usr/local/hadoop/etc/hadoop/* slave2:/usr/local/hadoop/etc/hadoop/

27th step: Format HDFS files

source /etc/environment
hdfs namenode -format

28th step: Start dfs

start-dfs.sh

29th step: Configure yarn

export HADOOP_HOME="/usr/local/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME

30th step: Configure yarn-site.xml — slaves

sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml

Add the following code:

<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>

31st step: Start yarn

start-yarn.sh

The final output

Just open your browser and enter the following URL: master:8088/cluster

--

--