Apache Hadoop — How to implement a Multi Node Distributed Plataform
In this tutorial, we will implement a distributed platform using Apache Hadoop on Ubuntu. In this case, we will use one machine as a primary and two as a secondary.
For this, we will have to use the following requirements:
- Ubuntu 18.04 version (Can be used on a virtualized machine in VirtualBox);
- Java 8 installed.
If you use Virtualbox, before connecting the virtual machine, we will have to configure the network using the following figure:
1st step: Install ssh
Open the command line and use the following command:
sudo apt install ssh
2nd step: Install pdsh
sudo apt install pdsh
3rd step: Configure .bashrc
To access this file, we will use the following command:
nano .bashrc
At the end of the file, we will make the assignment to the ssh variable, adding the following line:
export PDSH_RCMD_TYPE=ssh
Save the file using CTRL + O and then click ENTER. To exit, use CTRL + X.
4th step: Generate an ssh key
Returning to the command line, we will generate an ssh key
ssh-keygen -t rsa -P ""
5th step: Copy the generated key
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
6th step: Check the ssh configuration
Now let’s check if the ssh is well configured
ssh localhost
7th step: Install Java version 8
sudo apt install openjdk-8jdk
8th step: Verify the Java installation
To check if java has been installed, just use this command:
java -version
9th step: Install Hadoop
To install Hadoop, we will use the following command:
sudo wget -P /home/arga/Desktop https://mirrors.sonic.net/apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
10th step: Unzip Hadoop
So, let’s go back to Hadoop to unzip:
cd Desktop
tar xzf hadoop-3.2.1.tar.gz
11th step: Rename the folder
We will rename the folder in order to make it easier to access it
mv hadoop-3.2.1 hadoop
12th step: Configure the hadoop-env.sh file
Let’s open that file using the following commands:
cd hadoop
cd etc
cd hadoop
nano hadoop-env.sh
And let’s add the following line to this file:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
And don’t forget to save the file!! (CTRL + O, ENTER, CTRL + X)
13th step: Change the Hadoop folder directory
We will now move the Hadoop folder for easy use
sudo mv hadoop /usr/local/hadoop
To check if it has really been moved, let’s use this command:
cd
As you can see, the Hadoop folder has been moved successfully.
14th step: Configure the environment file
So let’s move on to environment file configuration
sudo nano /etc/environment
And in the file add the following configuration that is only in bold:
PATH=”/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/hadoop/bin:/usr/local/hadoop/sbin”JAVA_HOME=”/usr/lib/jvm/java-8-openjdk-amd64/jre”
Save the file!! (CTRL + O, ENTER, CTRL + X)
15th step: Create a Hadoop user
So, let’s move on to creating the Hadoop user
sudo adduser hadoopuser
16th step: Use the following commands
sudo usermod -aG hadoopuser hadoopuser
sudo chown hadoopuser:root -R /usr/local/hadoop/
sudo chmod g+rwx -R /usr/local/hadoop/
sudo adduser hadoopuser sudo
17th step: Check the IP Address
Let’s check the IP Address of the virtual machine. For this, we will use this command:
ip addr
As you can see, my machine’s IP is 192.168.56.103.
In order to do the multinode configuration, we will use the following IP:
- master: 192.168.56.103
- slave1: 192.168.56.104
- slave2: 192.168.56.105
18th step: Configure the hosts file
sudo nano /etc/hosts
And save the file!! (CTRL + O, ENTER, CTRL + X)
19th step: Create slaves
After making the necessary settings in the previous step, we will turn off the machine, which is the master, and we will clone it.
Para fazer esse passo, basta fazer CTRL + O and follow these steps that appear in the following figures:
We will repeat the same process because there are 2 secondary machines that we will want.
20th step: Configure the hostname file
We will now connect the 3 machines and we will change the hostname of each machine. For this, we will use the following command:
sudo nano /etc/hostname
And change the name of the master machine:
Then, it is necessary to restart the machine so that this configuration has been properly updated.
sudo reboot
Do exactly the same step, but for slaves machines.
21st step: Configure SSH in the master
su - hadoopuser
22nd step: Generate (again) an ssh key
ssh-keygen -t rsa
Next, we’ll copy that key for all users
ssh-copy-id hadoopuser@master
ssh-copy-id hadoopuser@slave1
ssh-copy-id hadoopuser@slave2
23rd step: Configure core-site.xml — master
sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml
Add the following code:
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
24th step: Configure hdfs-site.xml — master
sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
Add the following code:
<property>
<name>dfs.namenode.name.dir</name><value>/usr/local/hadoop/data/nameNode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name><value>/usr/local/hadoop/data/dataNode</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
25th step: Configure workers file — master
sudo nano /usr/local/hadoop/etc/hadoop/workers
Add the following code:
slave1
slave2
26th step: Copy the master configuration files to the slaves
scp /usr/local/hadoop/etc/hadoop/* slave1:/usr/local/hadoop/etc/hadoop/
scp /usr/local/hadoop/etc/hadoop/* slave2:/usr/local/hadoop/etc/hadoop/
27th step: Format HDFS files
source /etc/environment
hdfs namenode -format
28th step: Start dfs
start-dfs.sh
29th step: Configure yarn
export HADOOP_HOME="/usr/local/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
30th step: Configure yarn-site.xml — slaves
sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml
Add the following code:
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
31st step: Start yarn
start-yarn.sh
The final output
Just open your browser and enter the following URL: master:8088/cluster