set up hadoop cluster
Date: July 7th 2016
Last updated: July 7th 2016
The setup so far includes a DHCP server with one master node and three compute nodes. I followed this article http://www.widriksson.com/raspberry-pi-hadoop-cluster/ and "Raspberry Pi Super Cluster" by Andrew K. Dennis.
At this point I have Raspbian and Java installed on all nodes. I didn't install Java, it was already installed with Raspbian. I have one access point to the cluster via rpi1 (head node) and I can access each compute node on a subnet via 192.168.50.*. (e.g.. from rpi1... $> ssh rpi2 OR, $> ssh 192.168.50.11).
ON EACH NODE:
Create hduser account
sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser
sudo adduser hduser sudo # this is very important
Create authorized key for password-less communication
su hduser
mkdir ~/.ssh
ssh-keygen -t rsa -P ""
cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys
Distribute the key to each node
Note that the source of information to do this step was from https://www.digitalocean.com/community/tutorials/how-to-configure-ssh-key-based-authentication-on-a-linux-server
cat ~/.ssh/id_rsa.pub | ssh username@remote_host "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"
# my example (hduser@rpi1 distribute key to hduser@rpi2)
cat ~/.ssh/id_rsa.pub | ssh hduser@rpi2 "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"
# note the -p flag (--parents) will create a directory if it doesnt exist.
# You get no warning if the directory is not created, it just does nothing.
# distribute to rpi3
# distribute to rpi4
# etc
Make sure you can ssh into each node from rpi1
ssh rpi2
exit
ssh rpi3
exit
ssh rpi4
exit
Install Hadoop
Go to the head node (e.g. rpi1) as user = pi/root
Download and install Hadoop
exit
pi@rpi1:~ $ # at this point you should be the user=pi/host=rpi1 (head node)
cd ~/
wget http://apache.mirrors.spacedump.net/hadoop/core/hadoop-1.2.1/hadoop-1.2.1.tar.gz
# This might already exist after Raspbian install
sudo mkdir /opt
sudo tar -xvzf hadoop-1.2.1.tar.gz -C /opt/
cd /opt
# simplify the name
sudo mv hadoop-1.2.1 hadoop
# add priviledges for hduser
sudo chown -R hduser:hadoop hadoop
Copy hadoop.tar.gz to the other nodes for install
sudo scp ~/hadoop-1.2.1.tar.gz pi@rpi2:~/.
sudo scp ~/hadoop-1.2.1.tar.gz pi@rpi3:~/.
sudo scp ~/hadoop-1.2.1.tar.gz pi@rpi4:~/.
edit /etc/bash.bashrc
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
export HADOOP_INSTALL=/opt/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
return to command line - check hadoop is accessible for hduser
su hduser
hadoop version
exit
Go to each node individually (as user=pi/root) and do the same process (I'm not cloning the SD card which would have been a lot easier to do).
ssh pi@rpi2
# note pi users are not passwordless,
# so you need to enter the password.
# install hadoop
# This might already exist after Raspbian install
sudo mkdir /opt
sudo tar -xvzf hadoop-1.2.1.tar.gz -C /opt/
cd /opt
# simplify the name
sudo mv hadoop-1.2.1 hadoop
# add priviledges for hduser
sudo chown -R hduser:hadoop hadoop
# edit /etc/bash.bashrc
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
export HADOOP_INSTALL=/opt/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
# check it worked
su hduser
hadoop version
exit
Configure Hadoop on each node
edit /opt/hadoop/conf/hadoop-env.sh
# The java implementation to use. Required.
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
# The maximum amount of heap to use, in MB. Default is 1000.
export HADOOP_HEAPSIZE=250
# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTSi -client"
edit /opt/hadoop/conf/core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/hdfs/tmp</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://HOST:54310</value> ### NOTE: on each node replace HOST with host name
</property>
</configuration>
edit /opt/hadoop/conf/mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>HOST:54311</value> ### NOTE: on each node replace HOST with host name
</property>
</configuration>
edit /opt/hadoop/conf/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Create HDFS file system
sudo mkdir -p /hdfs/tmp
sudo chown hduser:hadoop /hdfs/tmp
sudo chmod 750 /hdfs/tmp
hadoop namenode -format
edit /opt/hadoop/conf/masters
rpi1
edit /opt/hadoop/conf/slaves
rpi1
rpi2
rpi3
rpi4
RUN
# start
/opt/hadoop/bin/start-dfs.sh
/opt/hadoop/bin/start-mapred.sh
# stop
/opt/hadoop/bin/stop-dfs.sh
/opt/hadoop/bin/stop-mapred.sh
#OR
start-all.sh
stop-all.sh
# CHECK OPERATION ON EACH NODE
jps
Track in web browser
http://192.168.1.108:50030
http://192.168.1.108:50070