Skip to content

Hadoop Cluster Configuration

Environment Configuration

User configuration

  • Create hadoop user sudo useradd -m hadoop -s /bin/bash
  • Set password for user 'hadoop' sudo passwd hadoop
  • Add hadoop user into sudoers sudo adduser hadoop sudo

You Need configure hadoop user on all related servers

SSH configuration

Master needs connect to all slaves via ssh without password, we need copy master's public key to all slaves.

  • Generate key pair for hadoop user on all servers with following steps:

  • Create .ssh directory for hadoop user mkdir -p ~/.ssh

  • Generate key pair ssh-keygen -t rsa -b 4096
  • Copy master's public key to all slaves ssh-copy-id haddop@slaves

Then in master server can ssh to any slaves without input the password

Java Installation

Install Java JDK on all servers with same version and location. And configure JAVA_HOME and add java into system or user environment path.

You can set the Java home and the path in user profile ~/.bashrc or system profile /etc/profile, and source the changed profile to make it effective

Hadoop Configuration

Download the Hadoop binary file from apache site to master server, and configure it with following steps:

Make sure the user hadoop have the ownership on the decompressed hadoop binary directory, and configure the hadoop under hadoop user

Configure hadoop

  • Set JAVA_HOME in hadoop enviroment file In $HADOOP_HOME/etc/hadoop/hadoop-env.sh file, edit export JAVA_HOME=$JAVA_HOME as export JAVA_HOME=${your java installation}
  • Configure core-site.xml as

    XML
    <configuration>
      <property>
        <name>hadoop.tmp.dir</name>
        <value>/usr/hadoop/tmp</value>
      </property>
      <property>
        <name>fs.default.name</name>
        <value>hdfs://$Master IP:9000</value>
      </property>
    </configuration>
    

    Make sure the path of hadoop.tmp.dir exists, and don't put it under /tmp

  • Configure hdfs-site.xml Based on how many slaves you are, make sure the dfs.replication value less than the number of slaves.

    XML
    <configuration>
      <property>
          <name>dfs.replication</name>
          <value>2</value>
      </property>
      <property>
          <name>dfs.namenode.datanode.registration.ip-hostname-check</name>
          <value>false</value>
      </property>
    </configuration>
    

    The dfs.namenode.datanode.registration.ip-hostname-check will disable the IP hostname check if you don't want it

  • Configure mapred-site.xml Configure the job tracker.

    XML
    <configuration>
      <property>
          <name>mapred.job.tracker</name>
          <value>http://$Mastetr IP:9001</value>
      </property>
    </configuration>
    

  • Configure slaves Edit file slaves add all slaves' IP address or hostname into it line by line. Master also can be added as slave.

  • Add HADOOP_HOME, and add $HADOOP_HOME/bin into PATH in master server

  • Copy the configured hadoop directory into all slaves scp -r $HADOOP_HOME hadoop@slave

Put $HADOOP_HOME on all slaves with same location as master, and run sudo chown -R hadoop:hadoop $HADOOP_HOME on all slaves if needed

Start & Stop Hadoop

  • Format namenode in master with user hadoop (one-time operation) hadoop namenode -format
  • Start Hadoop
    Text Only
       start-dfs.sh
       start-yarn.sh
    
    Open http://Master IP:50070 to view Hadoop status
  • Stop Hadoop
    Text Only
       stop-yarn.sh
       stop-dfs.sh
    

    You may search google for any issues