set up hadoop cluster

Date: July 7th 2016
Last updated: July 7th 2016

The setup so far includes a DHCP server with one master node and three compute nodes. I followed this article http://www.widriksson.com/raspberry-pi-hadoop-cluster/ and "Raspberry Pi Super Cluster" by Andrew K. Dennis.

At this point I have Raspbian and Java installed on all nodes. I didn't install Java, it was already installed with Raspbian. I have one access point to the cluster via rpi1 (head node) and I can access each compute node on a subnet via 192.168.50.*. (e.g.. from rpi1... $> ssh rpi2 OR, $> ssh 192.168.50.11).

ON EACH NODE:
Create hduser account

sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser
sudo adduser hduser sudo  # this is very important

Create authorized key for password-less communication

su hduser
mkdir ~/.ssh
ssh-keygen -t rsa -P ""
cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys

Distribute the key to each node
Note that the source of information to do this step was from https://www.digitalocean.com/community/tutorials/how-to-configure-ssh-key-based-authentication-on-a-linux-server

cat ~/.ssh/id_rsa.pub | ssh username@remote_host "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"

# my example (hduser@rpi1 distribute key to hduser@rpi2)
cat ~/.ssh/id_rsa.pub | ssh hduser@rpi2 "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"
# note the -p flag (--parents) will create a directory if it doesnt exist.
# You get no warning if the directory is not created, it just does nothing.

# distribute to rpi3
# distribute to rpi4
# etc

Make sure you can ssh into each node from rpi1

ssh rpi2
exit
ssh rpi3
exit
ssh rpi4 
exit

Install Hadoop
Go to the head node (e.g. rpi1) as user = pi/root
Download and install Hadoop

exit
pi@rpi1:~ $  # at this point you should be the user=pi/host=rpi1 (head node)
cd ~/
wget http://apache.mirrors.spacedump.net/hadoop/core/hadoop-1.2.1/hadoop-1.2.1.tar.gz
# This might already exist after Raspbian install
sudo mkdir /opt
sudo tar -xvzf hadoop-1.2.1.tar.gz -C /opt/
cd /opt
# simplify the name
sudo mv hadoop-1.2.1 hadoop
# add priviledges for hduser
sudo chown -R hduser:hadoop hadoop

Copy hadoop.tar.gz to the other nodes for install

sudo scp ~/hadoop-1.2.1.tar.gz pi@rpi2:~/.
sudo scp ~/hadoop-1.2.1.tar.gz pi@rpi3:~/.
sudo scp ~/hadoop-1.2.1.tar.gz pi@rpi4:~/.

edit /etc/bash.bashrc

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
export HADOOP_INSTALL=/opt/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin

return to command line - check hadoop is accessible for hduser

su hduser
hadoop version
exit

Go to each node individually (as user=pi/root) and do the same process (I'm not cloning the SD card which would have been a lot easier to do).

ssh pi@rpi2 
# note pi users are not passwordless, 
# so you need to enter the password.

# install hadoop 
# This might already exist after Raspbian install
sudo mkdir /opt
sudo tar -xvzf hadoop-1.2.1.tar.gz -C /opt/
cd /opt
# simplify the name
sudo mv hadoop-1.2.1 hadoop
# add priviledges for hduser
sudo chown -R hduser:hadoop hadoop

# edit /etc/bash.bashrc
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
export HADOOP_INSTALL=/opt/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin

# check it worked
su hduser
hadoop version
exit

Configure Hadoop on each node
edit /opt/hadoop/conf/hadoop-env.sh

# The java implementation to use. Required.
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")

# The maximum amount of heap to use, in MB. Default is 1000.
export HADOOP_HEAPSIZE=250

# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTSi -client"

edit /opt/hadoop/conf/core-site.xml

<configuration>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/hdfs/tmp</value>
  </property>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://HOST:54310</value> ### NOTE: on each node replace HOST with host name 
  </property>
</configuration>

edit /opt/hadoop/conf/mapred-site.xml

<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>HOST:54311</value> ### NOTE: on each node replace HOST with host name
  </property>
</configuration>

edit /opt/hadoop/conf/hdfs-site.xml

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

Create HDFS file system

sudo mkdir -p /hdfs/tmp
sudo chown hduser:hadoop /hdfs/tmp
sudo chmod 750 /hdfs/tmp
hadoop namenode -format

edit /opt/hadoop/conf/masters

rpi1

edit /opt/hadoop/conf/slaves

rpi1
rpi2
rpi3
rpi4

RUN

# start
/opt/hadoop/bin/start-dfs.sh
/opt/hadoop/bin/start-mapred.sh
# stop
/opt/hadoop/bin/stop-dfs.sh
/opt/hadoop/bin/stop-mapred.sh

#OR

start-all.sh
stop-all.sh

# CHECK OPERATION ON EACH NODE
jps

Track in web browser

http://192.168.1.108:50030
http://192.168.1.108:50070

results matching ""

    No results matching ""