Apache Hadoop — How to implement a Multi Node Distributed Plataform

In this tutorial, we will implement a distributed platform using Apache Hadoop on Ubuntu. In this case, we will use one machine as a primary and two as a secondary.

For this, we will have to use the following requirements:

  • Ubuntu 18.04 version (Can be used on a virtualized machine in VirtualBox);
  • Java 8 installed.

If you use Virtualbox, before connecting the virtual machine, we will have to configure the network using the following figure:

Open the and use the following command:

sudo apt install ssh
sudo apt install pdsh

To access this file, we will use the following command:

nano .bashrc

At the end of the file, we will make the assignment to the ssh variable, adding the following line:

export PDSH_RCMD_TYPE=ssh

Save the file using CTRL + O and then click ENTER. To exit, use CTRL + X.

Returning to the command line, we will generate an ssh key

ssh-keygen -t rsa -P ""
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Now let’s check if the ssh is well configured

ssh localhost
sudo apt install openjdk-8jdk

To check if java has been installed, just use this command:

java -version

To install Hadoop, we will use the following command:

sudo wget -P /home/arga/Desktop https://mirrors.sonic.net/apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz

So, let’s go back to Hadoop to unzip:

cd Desktop
tar xzf hadoop-3.2.1.tar.gz

We will rename the folder in order to make it easier to access it

mv hadoop-3.2.1 hadoop

Let’s open that file using the following commands:

cd hadoop
cd etc
cd hadoop
nano hadoop-env.sh

And let’s add the following line to this file:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

We will now move the Hadoop folder for easy use

sudo mv hadoop /usr/local/hadoop

To check if it has really been moved, let’s use this command:

cd

As you can see, the Hadoop folder has been moved successfully.

So let’s move on to environment file configuration

sudo nano /etc/environment

And in the file add the following configuration that is only in :

PATH=”/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/hadoop/bin:/usr/local/hadoop/sbinJAVA_HOME=”/usr/lib/jvm/java-8-openjdk-amd64/jre”

So, let’s move on to creating the Hadoop user

sudo adduser hadoopuser
sudo usermod -aG hadoopuser hadoopuser
sudo chown hadoopuser:root -R /usr/local/hadoop/
sudo chmod g+rwx -R /usr/local/hadoop/
sudo adduser hadoopuser sudo

Let’s check the IP Address of the virtual machine. For this, we will use this command:

ip addr

As you can see, my machine’s IP is .

In order to do the , we will use the following IP:

  • master:
  • slave1:
  • slave2:
sudo nano /etc/hosts

After making the necessary settings in the previous step, we will turn off the machine, which is the master, and we will clone it.

Para fazer esse passo, basta fazer and follow these steps that appear in the following figures:

We will repeat the same process because there are 2 secondary machines that we will want.

We will now connect the 3 machines and we will change the hostname of each machine. For this, we will use the following command:

sudo nano /etc/hostname

And change the name of the machine:

Then, it is necessary to restart the machine so that this configuration has been properly updated.

sudo reboot

Do exactly the same step, but for machines.

su - hadoopuser
ssh-keygen -t rsa

Next, we’ll copy that key for all users

ssh-copy-id hadoopuser@master
ssh-copy-id hadoopuser@slave1
ssh-copy-id hadoopuser@slave2
sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml

Add the following code:

<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml

Add the following code:

<property>
<name>dfs.namenode.name.dir</name><value>/usr/local/hadoop/data/nameNode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name><value>/usr/local/hadoop/data/dataNode</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
sudo nano /usr/local/hadoop/etc/hadoop/workers

Add the following code:

slave1
slave2
scp /usr/local/hadoop/etc/hadoop/* slave1:/usr/local/hadoop/etc/hadoop/
scp /usr/local/hadoop/etc/hadoop/* slave2:/usr/local/hadoop/etc/hadoop/
source /etc/environment
hdfs namenode -format
start-dfs.sh
export HADOOP_HOME="/usr/local/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml

Add the following code:

<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
start-yarn.sh

The final output

Just open your browser and enter the following URL: master:8088/cluster

Finalist student in Computer Engineering