Some Hadoop 2.X testing on a single node Fedora Server 24 cluster

Now we have a clean Hadoop installation is time of testing and configuration.

Our first test is to check if Hadoop is correctly installed and callable, we only need run the command

$ hadoop

This will print the usage of Hadoop. To operate with Hadoop without difficulties we need to disable ipv6

$ cd ~
$ cat > ~/disable-ipv6.conf << EOF
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
EOF
$ sudo mv ~/disable-ipv6.conf /etc/sysctl.d/
$ sudo systemctl restart NetworkManager.service

To verify if ipv6 services are disabled the commad

$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

should print 1. We also need disable firewalld in order to enable communications in all ports, to do this we should run

$ sudo systemctl stop firewalld.service

Sadly, at this moment, i cant found the way to disable firewalld service permanently, and

$ sudo systemctl disable firewalld.service

is not working and firewalld is started at boot time.

Additionally, and for convenience, we make all files and directories inside HADOOP_PREFIX owned by hpc user

$ sudo chown -R hpc:hpc ${HADOOP_PREFIX}

After this, we are ready to test Hadoop.

The Standalone mode

For debug purposes we can run Hadoop in stand alone mode. using Hadoop directly without any configuration.

Hadoop Example

The fist example is provided by hadoop itself: we will read some files and locate the incidences of words that matches with the regular expression 'dfs[a-z.]+'

$ mkdir -p ~/hadoop/input_1
$ cd ~/hadoop
$ cp ${HADOOP_PREFIC}/etc/hadoop/*.xml input_1
$ hadoop jar ${HADOOP_PREFIX}/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar grep input_1 output_1 'dfs[a-z.]+'
$ cat output_1
MapReduce Streaming Example

Hadoop has the ability to run code in other languages besides java through Hadoop Streaming. In this example we will analyze the sales of some stores with python scripts. The necessary are mapper.py, reducer.py and
purchases.txt. To run this example write the following commands in your terminal

$ mkdir -p ~/hadoop/input_2
$ cd ~/hadoop
$ wget -O mapper.py https://docs.google.com/uc?id=0B1UwrvDkN0YRTUlGMDBEV3ZpcW8&export=download
$ wget -O reducer.py https://docs.google.com/uc?id=0B1UwrvDkN0YRUVg3eWQzMVhESk0&export=download
$ chmod +x mapper.py reducer.py
$ wget -O input_2/purchases.txt https://doc-0g-ao-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/j2291p25o10j900kn02o1moikd9nrai0/1469138400000/02316338616399157944/*/0B1UwrvDkN0YRdUsxNUgzRkIxUzQ?e=download
hadoop jar ${HADOOP_PREFIX}/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar -input input_2 -output output_2 -mapper mapper.py -reducer reducer.py
$ cat output_2

this example will output the total amount of sales for every store acording to the data in purchases.txt

The Pseudo-distributed mode

After the stand alone testing the next step is configure our Hadoop to run in a pseudo distributed mode, simulating our machine as a single node cluster.

Phase 0-X: Previous steps

Before star with configurations of Hadoop we need to do tasks in our machine: 1) Create our ssh key and 2) ensure that our machine can ssh it-self.

$ cd ~
$ mkdir -p ~/.ssh
$ ssh-keygen -t rsa -b 4096 -C "" -P "" -f ~/.ssh/mastervi
$ touch ~/.ssh/authorized_keys
$ cat ~/.ssh/mastervi.pub ~/.ssh/authorized_keys
ssh -i ~/.ssh/mastervi hpc@localhost

To ensure that we can ssh without the -i flag, we need to add the following lines to our terminal profile file (~/.bashrc, ~/.profile or ~/.zshrc)

export SSH_KEY_PATH="/home/hpc/.ssh/mastervi"

if [ ! -S ~/.ssh/ssh_auth_sock ]; then
  eval `ssh-agent -s`  > /dev/null 2>&1
  ln -sf "$SSH_AUTH_SOCK" ~/.ssh/ssh_auth_sock
fi

export SSH_AUTH_SOCK=~/.ssh/ssh_auth_sock
ssh-add ${SSH_KEY_PATH} > /dev/null 2>&1

After exit and re-login we can ssh to our localhost without -i flag.

Phase 1-A: Use of hdfs

The fist thing that we will do in order to run Hadoop in a pseudo distributed mode is enable the ability to use HDFS. For this we need to edit the following configuration files

  1. ${HADOOP_PREFIX}/etc/hadoop/hadoop-env.sh
    # export JAVA_HOME=${JAVA_HOME}
    export JAVA_HOME="$(dirname $(dirname $(readlink -f $(which javac))))"
    

    For some reason sart-hdfs.sh can’t read the JAVA_HOME when is defined as the first line above.

  2. ${HADOOP_PREFIX}/etc/hadoop/core-site.xml
    <configuration>
        <property>
            <name>fs.defaultFS</name>
            <value>hdfs://localhost:9000</value>
        </property>
    </configuration>
    
  3. ${HADOOP_PREFIX}/etc/hadoop/hdfs-site.xml
    <configuration>
        <property>
            <name>dfs.replication</name>
            <value>1</value>
        </property>
    </configuration>
    

Now that we have configured our HDFS is time to start the it

$ hdfs namenode -format
$ start-dfs.sh

To ensure that all things are running correctly we can use anyone of the following commands

$ jps
$ hdfs dfsadmin -report

Both should show us one namenode and one datanode running. Now we need to prepare the HDFS system to load our data.

$ hdfs dfs -mkdir -p /user/hpc
$ rm -r output*

And that’s all, we are ready to run our examples.

Phase 1-B: Hadoop Example

This example can be run in the following way

$ hdfs dfs -put input_1 /user/hpc
$ hadoop jar ${HADOOP_PREFIX}/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar grep input_1 output_1 'dfs[a-z.]+'

To see the results we can run anyone of the followings commnads

$ hdfs dfs -get /user/hpc/output_1
$ cat /output_1/*

or

$ hdfs dfs -cat /user/hpc/output_1/*
Phase 1-C: MapReduce Streaming Example
$ hdfs dfs -put input_2 /user/hpc
$ hadoop jar ${HADOOP_PREFIX}/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar -input input_2 -output output_2 -mapper mapper.py -reducer reducer.py -file mapper.py -file reducer.py

In this case the -file flag causes the python executable shipped to the cluster machines (one node in our case) as a part of job submission.

To see the results we can run anyone of the followings commands

$ hdfs dfs -get /user/hpc/output_2
$ cat /output_2/*

or

$ hdfs dfs -cat /user/hpc/output_2/*

Before to go to the phase 2 is necessary shutdown the HDFS system with

$ stop-dfs.sh
$ rm -r output*
rm -rf /tmp/hadoop*  /tmp/hsperfdata* /tmp/Jetty_*
Phase 2-A: Use of YARN

Now we will use YARN (Yet Another Resource Negotiator), a cluster management technology released with Hadoop 2.X. In order to do this the following additional configurations are necessary:

  1. ${HADOOP_PREFIX}/etc/hadoop/mapred-site.xml
    <configuration>
        <property>
            <name>mapreduce.framework.name</name>
            <value>yarn</value>
        </property>
        <property>
            <name>mapreduce.jobhistory.address</name>
            <value>cm:10020</value>
        </property>
    </configuration>
    
  2. ${HADOOP_PREFIX}/etc/hadoop/yarn-site.xml
    <configuration>
        <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
        </property>
    </configuration>
    

Now that we have configured HDFS and YARN is time to start our system

$ hdfs namenode -format
$ start-dfs.sh
$ start-yarn.sh
mr-jobhistory-daemon.sh start historyserver

Check if everything is running well

$ jps
$ hdfs dfsadmin -report

And create the folders to put our data in HDFS.

$ hdfs dfs -mkdir -p /user/hpc
Phase 2-B: Hadoop Example

At this stage we have experience running our examples so just run it

$ hdfs dfs -put input_1 /user/hpc
$ hadoop jar ${HADOOP_PREFIX}/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar grep input_1 output_1 'dfs[a-z.]+'
$ hdfs dfs -get /user/hpc/output_1
$ cat /output_1/*
Phase 2-C: MapReduce Streaming Example

In the same way run this commands

$ hdfs dfs -put input_2 /user/hpc
$ hadoop jar ${HADOOP_PREFIX}/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar -input input_2 -output output_2 -mapper mapper.py -reducer reducer.py -file mapper.py -file reducer.py
$ hdfs dfs -get /user/hpc/output_2
$ cat /output_2/*
Phase 2-X: Monitorign from the host system

When we run some task in hadoop we can get an output similar to the shown in figure 1.

Screenshot from 2016-07-21 20-06-17
Fgiure 1. Sample output of Hadoop task

We can take the url sowed up and put it into the web browser of our host system and see the progress of the task.

Screenshot from 2016-07-21 20-03-54
Fgiure 2. Hadoop web monitoring

If the work is well down is time to clean our system

$ stop-dfs.sh
$ stop-yarn.sh
mr-jobhistory-daemon.sh stop historyserver
$ rm -r output*
rm -rf /tmp/hadoop*  /tmp/hsperfdata* /tmp/Jetty_*
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s