Hadoop: April 2015

Wednesday, April 29, 2015

6. Start Hadoop, Parse code by Musthafa + Kiran Kumar ...

The following are the instructions/steps to follow to use hadoop.

1.Create a folder user in the hadoop file system
          hdfs dfs -mkdir /user

2.Create a folder with name shpould be same as username inside the user folder
        hdfs dfs -mkdir /user/<username>

3.Keep all the sample data files in a local folder (in local file system)
           e.g /home/<username>/input/input1.dat, /home/<username>/input/input2.dat,......

4.Copy the input folder into hadoop file system
         hdfs dfs -put /home/<username>/input
    Note: by default it will copy the files and folders under hdfs://host:port/user/<username>/
    The result of the above command is it will copy the input folder to hdfs:/user/<username>/

5.To check whether the input files copied or not use the following command
    hdfs dfs -ls /user/<username>
    result : input
    hdfs dfs -ls /user/<username>/input
    result : input1.dat, input2.dat, .....

6.Write a java code to parse and reduce (Includes business logic) in our example source code is :

import java.io.BufferedReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.*;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Counter;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class ByteCount {

public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    static enum CountersEnum { INPUT_WORDS }

    private static IntWritable bytecounter = new IntWritable(1);
    private Text word = new Text();

    private Configuration conf;
    private BufferedReader fis;
    int bytecount=0;

    @Override
    public void setup(Context context) throws IOException,
        InterruptedException {
      conf = context.getConfiguration();

    }

    @Override
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {

        String line = value.toString();
        Matcher ipMatcher =Pattern.compile(" dst=([^ \t]*)").matcher(line);
        Matcher bytesMatcher=Pattern.compile(" rcvd=([^ \t]*)").matcher(line);

    String ip="";
    String bytes="";

    if(ipMatcher.find() && bytesMatcher.find())
    {
        ip=ipMatcher.group(1);
        bytes=bytesMatcher.group(1);

    }

    System.out.println(" IP : "+ip);
    System.out.println(" BYTES : "+bytes);
        word.set(ip);
    int tmp=0;
    try{
        if(bytes.trim().isEmpty())
            tmp=0;
        else
            tmp=Integer.parseInt(bytes);
    }
    finally{

    }
        bytecounter=new IntWritable(tmp);
        context.write(word, bytecounter);
        Counter counter = context.getCounter(CountersEnum.class.getName(),
            CountersEnum.INPUT_WORDS.toString());
        counter.increment(tmp);

    }
}

public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
}

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    GenericOptionsParser optionParser = new GenericOptionsParser(conf, args);
    String[] remainingArgs = optionParser.getRemainingArgs();
    if (!(remainingArgs.length != 2 || remainingArgs.length != 4)) {
      System.err.println("Usage: bytecount <in> <out> [-skip skipPatternFile]");
      System.exit(2);
    }
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(ByteCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    List<String> otherArgs = new ArrayList<String>();
    for (int i=0; i < remainingArgs.length; ++i) {
      if ("-skip".equals(remainingArgs[i])) {
        job.addCacheFile(new Path(remainingArgs[++i]).toUri());
        job.getConfiguration().setBoolean("bytecount.skip.patterns", true);
      } else {
        otherArgs.add(remainingArgs[i]);
      }
    }
    FileInputFormat.addInputPath(job, new Path(otherArgs.get(0)));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs.get(1)));

    System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}


7.Compile sources.
    hadoop com.sun.tools.javac.Main ByteCount.java

8.Make a jar file with all class files generated by above comand.
    jar cf wc.jar ByteCount*.class

9.Run hadoop tasks
    hadoop jar wc.jar ByteCount /user/<username>/input/ output

    Note: by default it create the output folder under hdfs:/user/<username>/

    in this case the output folder is hdfs:/user/<username>/output

10.Copy the output folder into local file system
    hdfs dfs -get output ./hadoop_output

11.Display the output
    cat ./hadoop_output/*

Result :

103.1.124.7    1234
103.1.124.8    314
103.1.124.9    3893
103.20.92.129    337565
103.229.206.84    134
103.243.222.155    5773
103.243.222.32    17127
103.243.222.41    5677
103.243.222.51    2330
103.243.222.81    4334
103.243.222.85    5703
103.243.222.93    1133
103.243.222.95    19137
103.245.222.134    12489
103.245.222.143    109683
103.245.222.166    1718
103.245.222.175    64680
103.245.222.249    53584
103.30.235.115    243
103.31.6.35    21930
104.155.232.138    11925
104.156.81.217    11026
104.156.85.134    8513
104.156.85.217    11289
104.16.12.13    20912
104.16.13.13    2101
....
....
....
....
....
....
....
....

Thanks and regards
Kirankumar and Musthafa

5. Follow-on meeting on 15th Apr

Hi,* Musthafa shared with us Architecture of Apache Hadoop.. He pointed out that this link -https://www.digitalocean.com/community/tutorials/how-to-install-hadoop-on-ubuntu-13-10 - that has good installation instructions.* Our next meeting will be on coming Monday @ 5.00 pm.

* Installation of Hadoop is in Musthafa & Prateek system. [Hey.. Pl share your IP address-es, so that we can connect, if we want].Our immediate plan:

* Pavithra will create a blog and share with others, so that we can stay in sync.

* Gowri to send 10 data files= (a) Zipped files attached (b) Another type of router log (critical) also attached.

* Read these files and convert them into objects in Hadoop. Kiran has volunteered to work on this area.

* Send a request to Hadoop to give a sum of bytes sent to a given list of IP addresses in this fashion. Musthafa & Prateek have volunteered to work on this and share their learning/code snippets.

IP Address : Bytes 202.54.32.4 : 32015 192.145.224.55: 459068 192.124.0.0 : 1246

4. Initial Installation Document

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Read More

This article will help you for step by step install and configure single node hadoop cluster.

Step 1. Install Java

Before installing hadoop make sure you have java installed on your system. If you do not have java installed use following article to install Java.

Steps to install JAVA on CentOS and RHEL 5/6

Step 2. Create User Account

Create a system user account to use for hadoop installation.

# useradd hadoop

# passwd hadoop

Changing password for user hadoop.

New password:

Retype new password:

passwd: all authentication tokens updated successfully.

Step 3. Configuring Key Based Login

Its required to setup hadoop user to ssh itself without password. Using following method it will enable key based login for hadoop user.

# su - hadoop

$ ssh-keygen -t rsa

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

$ chmod 0600 ~/.ssh/authorized_keys

$ exit

Step 4. Download and Extract Hadoop Source

Downlaod hadoop latest availabe version from its official site, and follow below steps.

# mkdir /opt/hadoop

# cd /opt/hadoop/

# wget http://apache.mesi.com.ar/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz

# tar -xzf hadoop-1.2.1.tar.gz

# mv hadoop-1.2.1 hadoop

# chown -R hadoop /opt/hadoop

# cd /opt/hadoop/hadoop/

Step 5: Configure Hadoop

First edit hadoop configuration files and make following changes.5.1 Edit core-site.xml

# vim conf/core-site.xml

#Add the following inside the configuration tag

<name>fs.default.name</name>

<value>hdfs://localhost:9000/</value>

</property>

<name>dfs.permissions</name>

<value>false</value>

</property>

5.2 Edit hdfs-site.xml

# vim conf/hdfs-site.xml

# Add the following inside the configuration tag

<value>/opt/hadoop/hadoop/dfs/name/data</value>

</property>

<value>/opt/hadoop/hadoop/dfs/name</value>

</property>

<name>dfs.replication</name>

</property>

5.3 Edit mapred-site.xml

# vim conf/mapred-site.xml

# Add the following inside the configuration tag

<name>mapred.job.tracker</name>

<value>localhost:9001</value>

</property>

5.4 Edit hadoop-env.sh

# vim conf/hadoop-env.sh

export JAVA_HOME=/opt/jdk1.7.0_17

export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

Set JAVA_HOME path as per your system configuration for java.

Next to format Name Node

# su - hadoop

$ cd /opt/hadoop/hadoop

$ bin/hadoop namenode -format

13/06/02 22:53:48 INFO namenode.NameNode: STARTUP_MSG:

/************************************************************

STARTUP_MSG: Starting NameNode

STARTUP_MSG: host = srv1.tecadmin.net/192.168.1.90

STARTUP_MSG: args = [-format]

STARTUP_MSG: version = 1.2.1

STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1479473; compiled by 'hortonfo' on Mon May 6 06:59:37 UTC 2013

STARTUP_MSG: java = 1.7.0_17

************************************************************/

13/06/02 22:53:48 INFO util.GSet: Computing capacity for map BlocksMap

13/06/02 22:53:48 INFO util.GSet: VM type = 32-bit

13/06/02 22:53:48 INFO util.GSet: 2.0% max memory = 1013645312

13/06/02 22:53:48 INFO util.GSet: capacity = 2^22 = 4194304 entries

13/06/02 22:53:48 INFO util.GSet: recommended=4194304, actual=4194304

13/06/02 22:53:49 INFO namenode.FSNamesystem: fsOwner=hadoop

13/06/02 22:53:49 INFO namenode.FSNamesystem: supergroup=supergroup

13/06/02 22:53:49 INFO namenode.FSNamesystem: isPermissionEnabled=true

13/06/02 22:53:49 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100

13/06/02 22:53:49 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)

13/06/02 22:53:49 INFO namenode.FSEditLog: dfs.namenode.edits.toleration.length = 0

13/06/02 22:53:49 INFO namenode.NameNode: Caching file names occuring more than 10 times

13/06/02 22:53:49 INFO common.Storage: Image file of size 112 saved in 0 seconds.

13/06/02 22:53:49 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/opt/hadoop/hadoop/dfs/name/current/edits

13/06/02 22:53:49 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/opt/hadoop/hadoop/dfs/name/current/edits

13/06/02 22:53:49 INFO common.Storage: Storage directory /opt/hadoop/hadoop/dfs/name has been successfully formatted.

13/06/02 22:53:49 INFO namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at srv1.tecadmin.net/192.168.1.90

************************************************************/

Step 6: Start Hadoop Services

Use the following command to start all hadoop services.

$ bin/start-all.sh

[sample output]

starting namenode, logging to /opt/hadoop/hadoop/libexec/../logs/hadoop-hadoop-namenode-ns1.tecadmin.net.out

localhost: starting datanode, logging to /opt/hadoop/hadoop/libexec/../logs/hadoop-hadoop-datanode-ns1.tecadmin.net.out

localhost: starting secondarynamenode, logging to /opt/hadoop/hadoop/libexec/../logs/hadoop-hadoop-secondarynamenode-ns1 .tecadmin.net.out

starting jobtracker, logging to /opt/hadoop/hadoop/libexec/../logs/hadoop-hadoop-jobtracker-ns1.tecadmin.net.out

localhost: starting tasktracker, logging to /opt/hadoop/hadoop/libexec/../logs/hadoop-hadoop-tasktracker-ns1.tecadmin.ne t.out

Step 7: Test and Access Hadoop Services

Use ‘jps‘ command to check if all services are started well.

$ jps

$ $JAVA_HOME/bin/jps

26049 SecondaryNameNode

25929 DataNode

26399 Jps

26129 JobTracker

26249 TaskTracker

25807 NameNode

Web Access URLs for Services

http://srv1.tecadmin.net:50030/ for the Jobtracker

http://srv1.tecadmin.net:50070/ for the Namenode

http://srv1.tecadmin.net:50060/ for the Tasktracker

Hadoop JobTracker:

Hadoop Namenode:

Hadoop TaskTracker:

Step 8: Stop Hadoop Services

If you do no need anymore hadoop. Stop all hadoop services using following command.

# bin/stop-all.sh

3. Defined Goal

Agenda for 5.00 pm meeting on 10th Apr, 2015:

a) Sharing a single line of log file
b) Expected output i.e. To make three graphs out of it
             1. Past records analysis
             2. Showing in-progress data (Dashboard)
             3. Future load projection
c) Immediate action plan
    * Install Hadoop in 2 systems (so that we can test scaling) - Need 2 volunteers
    * Port data - Need 1 volunteer  to learn and share what needs to be done
    * Analyze using map-reduce to get graphs (1) - Need 1 volunteer to identify how to do this
d) Next meet on coming Wednesday, 5.00 pm
e) A brief Q & A session

Sample log lines:

[00001] 2015-04-10 11:21:39 [Root]system-notification-00257(traffic): start_time="2015-04-10 11:21:39" duration=0 policy_id=11 service=http proto=6 src zone=Trust dst zone=Untrust action=Permit sent=0 rcvd=0 src=192.168.16.236 dst=62.67.193.31 src_port=43855 dst_port=80 translated ip=111.93.148.154 port=1101
[00002] 2015-04-10 11:21:39 [Root]system-notification-00257(traffic): start_time="2015-04-10 11:21:39" duration=0 policy_id=7 service=NETBIOS (NS) proto=17 src zone=Trust dst zone=VPN action=Permit sent=0 rcvd=0 src=192.168.16.87 dst=192.168.0.32 src_port=137 dst_port=137 translated ip=192.168.16.87 port=137 
:

: 
[00115] 2015-04-10 11:21:43 [Root]system-notification-00257(traffic): start_time="2015-04-10 11:21:43" duration=0 policy_id=11 service=dns proto=17 src zone=Trust dst zone=Untrust action=Permit sent=0 rcvd=0 src=192.168.16.12 dst=205.251.192.60 src_port=40901 dst_port=53 translated ip=111.93.148.154 port=1772

Quick summary of our meeting:
We briefly discussed the problem i.e. Analyze router log (mail files) and make sense out of it using Hadoop.

Our primary purpose is to learn analysis of huge amounts of data using Hadoop, with a specific deliverable, so that we can have problem to tackle and solve.

c) Immediate action plan
   * Install Hadoop in 2 systems (so that we can test scaling) : Need 2 volunteers ::
      - 10th Apr '15: We got Pavithra, Radhakrish & Prateek.
      - Kiran Patil will assist us in the installation, if there are any issues.
      - Kiran Patil also offered to create a VM, on which he will install the Hadoop.
      - Hadoop will be installed in local system itself, provided the needed version java does not hinder with their project java.

   * Port data - Need 1 volunteer to learn and share what needs to be done
      - Kiran Kumar, Sangamesh & Prateek volunteered.
      - What needs to be done? If there is data in a file (sample format shown below), how to import it into Hadoop environment?

   * Analyze using map-reduce to get graphs (1) - Need 1 volunteer to identify how to do this
      - Mustafa has volunteered to learn about map reduce and share it with the team.
      - ?: If there is data in hadoop, how to read it (map reduce it) for further processing?

Few questions discussed:- Sangmesh mentioned that we might need at least 3 systems, for apache hadoop to be running. So, three volunteers were needed and we got it.


? Why Hadoop?
! We needed a environment that allows us to import and analyze data. Hadoop was chosen in view of its non restrictive license.

? How to handle duplicates, if any, in input data?
! Logically, there won't be any duplicate. If there are any, we need to handle them.. [How??? That was the question... Huh?? I couldn't hear you.. I guess your cell phone tower signal is weak].
? What is the deliverable for Wednesday? Is it the graph itself?
! No. Graph 1 is not the deliverable for Wednesday. We will meet on Wednesday to share our learnings and then take it forward on how to apply them.

What more?
Gowri & Giri will prepare a quick time line and share it on Wednesday.

2. Responses

Mustafa: Count me in.
Radhakrishna: I would like to attend the meeting.
Pavithra: Please count me in.
Kiran Kumar: Yes Gowri, I am interested.
Prateek: I am Interested Gowri..

1. EOI mail

Hi, Interested in learning Hadoop? Our Kiran Patil has a small problem in his sys admin domain that needs a solution. I'm planning to work on Hadoop in my free time to address it. Expected completion: Mid May, 2015. Kick off meeting: 5.00 pm today. If you are interested, pl reply back to this mail before 3.00 pm today. Thanks & Regards, n gowrisankar. PS: Kiran >> You are included already, so no need to reply to this mail :).