Wednesday, April 29, 2015

6. Start Hadoop, Parse code by Musthafa + Kiran Kumar ...

The following are the instructions/steps  to follow to use hadoop.

1.Create a folder user in the hadoop file system
          hdfs dfs -mkdir /user

2.Create a folder with name shpould be same as username inside the user folder
          hdfs dfs -mkdir /user/<username>

3.Keep all the sample data files in a local folder (in local file system)
           e.g  /home/<username>/input/input1.dat, /home/<username>/input/input2.dat,......

4.Copy the input folder into hadoop file system
         hdfs dfs -put /home/<username>/input
    Note: by default it will copy the files and folders under hdfs://host:port/user/<username>/
    The result of the above command is it will copy the input folder to hdfs:/user/<username>/

5.To check whether the input files copied or not use the following command
    hdfs dfs -ls /user/<username>
    result : input
    hdfs dfs -ls /user/<username>/input
    result : input1.dat, input2.dat, .....

6.Write a java code to parse and reduce (Includes business logic)  in our example source code is :
  
import java.io.BufferedReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.*;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Counter;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class ByteCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    static enum CountersEnum { INPUT_WORDS }

    private static IntWritable bytecounter = new IntWritable(1);
    private Text word = new Text();

    private Configuration conf;
    private BufferedReader fis;
    int bytecount=0;

    @Override
    public void setup(Context context) throws IOException,
        InterruptedException {
      conf = context.getConfiguration();

    }


    @Override
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {


        String line = value.toString();
        Matcher ipMatcher =Pattern.compile(" dst=([^ \t]*)").matcher(line);
        Matcher bytesMatcher=Pattern.compile(" rcvd=([^ \t]*)").matcher(line);

    String ip="";
    String bytes="";

    if(ipMatcher.find() && bytesMatcher.find())
    {
        ip=ipMatcher.group(1);
        bytes=bytesMatcher.group(1);

    }

    System.out.println(" IP : "+ip);
    System.out.println(" BYTES : "+bytes);
        word.set(ip);
    int tmp=0;
    try{
        if(bytes.trim().isEmpty())
            tmp=0;
        else
            tmp=Integer.parseInt(bytes);
    }
    finally{

    }
        bytecounter=new IntWritable(tmp);
        context.write(word, bytecounter);
        Counter counter = context.getCounter(CountersEnum.class.getName(),
            CountersEnum.INPUT_WORDS.toString());
        counter.increment(tmp);

    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    GenericOptionsParser optionParser = new GenericOptionsParser(conf, args);
    String[] remainingArgs = optionParser.getRemainingArgs();
    if (!(remainingArgs.length != 2 || remainingArgs.length != 4)) {
      System.err.println("Usage: bytecount <in> <out> [-skip skipPatternFile]");
      System.exit(2);
    }
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(ByteCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    List<String> otherArgs = new ArrayList<String>();
    for (int i=0; i < remainingArgs.length; ++i) {
      if ("-skip".equals(remainingArgs[i])) {
        job.addCacheFile(new Path(remainingArgs[++i]).toUri());
        job.getConfiguration().setBoolean("bytecount.skip.patterns", true);
      } else {
        otherArgs.add(remainingArgs[i]);
      }
    }
    FileInputFormat.addInputPath(job, new Path(otherArgs.get(0)));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs.get(1)));

    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}


       
7.Compile sources.
    hadoop com.sun.tools.javac.Main ByteCount.java

8.Make a jar file with all class files generated by above comand.
    jar cf wc.jar ByteCount*.class

9.Run hadoop tasks
    hadoop jar wc.jar ByteCount /user/<username>/input/ output
   
    Note: by default it create the output folder under
hdfs:/user/<username>/
   
    in this case the output  folder is
hdfs:/user/<username>/output

10.Copy the output folder into local file system
    hdfs dfs -get output ./hadoop_output

11.Display the output
    cat ./hadoop_output/*

Result :

103.1.124.7    1234
103.1.124.8    314
103.1.124.9    3893
103.20.92.129    337565
103.229.206.84    134
103.243.222.155    5773
103.243.222.32    17127
103.243.222.41    5677
103.243.222.51    2330
103.243.222.81    4334
103.243.222.85    5703
103.243.222.93    1133
103.243.222.95    19137
103.245.222.134    12489
103.245.222.143    109683
103.245.222.166    1718
103.245.222.175    64680
103.245.222.249    53584
103.30.235.115    243
103.31.6.35    21930
104.155.232.138    11925
104.156.81.217    11026
104.156.85.134    8513
104.156.85.217    11289
104.16.12.13    20912
104.16.13.13    2101
....
....
....
....
....
....
....
....


Thanks and regards
Kirankumar  and Musthafa

5. Follow-on meeting on 15th Apr

Hi,
* Musthafa shared with us Architecture of Apache Hadoop.. He pointed out that this link - https://www.digitalocean.com/community/tutorials/how-to-install-hadoop-on-ubuntu-13-10  - that has good installation instructions.

* Our next meeting will be on coming Monday @ 5.00 pm.


* Installation of Hadoop is in Musthafa & Prateek system. [Hey.. Pl share your IP address-es, so that we can connect, if we want].

Our immediate plan:
* Pavithra will create a blog and share with others, so that we can stay in sync.
* Gowri to send 10 data files = (a) Zipped files attached (b) Another type of router log (critical) also attached.
* Read these files and convert them into objects in Hadoop. Kiran has volunteered to work on this area.
* Send a request to Hadoop to give a sum of bytes sent to a given list of IP addresses in this fashion. Musthafa & Prateek have volunteered to work on this and share their learning/code snippets.
           IP Address    :  Bytes
           202.54.32.4   :  32015
           192.145.224.55: 459068
           192.124.0.0   :  1246

4. Initial Installation Document


The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Read More
This article will help you for step by step install and configure single node hadoop cluster.
Step 1. Install Java
Before installing hadoop make sure you have java installed on your system. If you do not have java installed use following article to install Java.
Step 2. Create User Account
Create a system user account to use for hadoop installation.
# useradd hadoop
# passwd hadoop
Changing password for user hadoop.
New password:
Retype new password:
passwd: all authentication tokens updated successfully.
Step 3. Configuring Key Based Login
Its required to setup hadoop user to ssh itself without password. Using following method it will enable key based login for hadoop user.
# su - hadoop
$ ssh-keygen -t rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys
$ exit
Step 4. Download and Extract Hadoop Source
Downlaod hadoop latest availabe version from its official site, and follow below steps.
# mkdir /opt/hadoop
# cd /opt/hadoop/
# wget http://apache.mesi.com.ar/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz
# tar -xzf hadoop-1.2.1.tar.gz
# mv hadoop-1.2.1 hadoop
# chown -R hadoop /opt/hadoop
# cd /opt/hadoop/hadoop/
Step 5: Configure Hadoop
First edit hadoop configuration files and make following changes.5.1 Edit core-site.xml
# vim conf/core-site.xml
#Add the following inside the configuration tag
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000/</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
5.2 Edit hdfs-site.xml
# vim conf/hdfs-site.xml
# Add the following inside the configuration tag
<property>
<name>dfs.data.dir</name>
<value>/opt/hadoop/hadoop/dfs/name/data</value>
<final>true</final>
</property>
<property>
<name>dfs.name.dir</name>
<value>/opt/hadoop/hadoop/dfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
5.3 Edit mapred-site.xml
# vim conf/mapred-site.xml
# Add the following inside the configuration tag
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
5.4 Edit hadoop-env.sh
# vim conf/hadoop-env.sh
export JAVA_HOME=/opt/jdk1.7.0_17
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
Set JAVA_HOME path as per your system configuration for java.
Next to format Name Node
# su - hadoop
$ cd /opt/hadoop/hadoop
$ bin/hadoop namenode -format
13/06/02 22:53:48 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = srv1.tecadmin.net/192.168.1.90
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.2.1
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1479473; compiled by 'hortonfo' on Mon May 6 06:59:37 UTC 2013
STARTUP_MSG: java = 1.7.0_17
************************************************************/
13/06/02 22:53:48 INFO util.GSet: Computing capacity for map BlocksMap
13/06/02 22:53:48 INFO util.GSet: VM type = 32-bit
13/06/02 22:53:48 INFO util.GSet: 2.0% max memory = 1013645312
13/06/02 22:53:48 INFO util.GSet: capacity = 2^22 = 4194304 entries
13/06/02 22:53:48 INFO util.GSet: recommended=4194304, actual=4194304
13/06/02 22:53:49 INFO namenode.FSNamesystem: fsOwner=hadoop
13/06/02 22:53:49 INFO namenode.FSNamesystem: supergroup=supergroup
13/06/02 22:53:49 INFO namenode.FSNamesystem: isPermissionEnabled=true
13/06/02 22:53:49 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
13/06/02 22:53:49 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
13/06/02 22:53:49 INFO namenode.FSEditLog: dfs.namenode.edits.toleration.length = 0
13/06/02 22:53:49 INFO namenode.NameNode: Caching file names occuring more than 10 times
13/06/02 22:53:49 INFO common.Storage: Image file of size 112 saved in 0 seconds.
13/06/02 22:53:49 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/opt/hadoop/hadoop/dfs/name/current/edits
13/06/02 22:53:49 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/opt/hadoop/hadoop/dfs/name/current/edits
13/06/02 22:53:49 INFO common.Storage: Storage directory /opt/hadoop/hadoop/dfs/name has been successfully formatted.
13/06/02 22:53:49 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at srv1.tecadmin.net/192.168.1.90
************************************************************/
Step 6: Start Hadoop Services
Use the following command to start all hadoop services.
$ bin/start-all.sh
[sample output]
starting namenode, logging to /opt/hadoop/hadoop/libexec/../logs/hadoop-hadoop-namenode-ns1.tecadmin.net.out
localhost: starting datanode, logging to /opt/hadoop/hadoop/libexec/../logs/hadoop-hadoop-datanode-ns1.tecadmin.net.out
localhost: starting secondarynamenode, logging to /opt/hadoop/hadoop/libexec/../logs/hadoop-hadoop-secondarynamenode-ns1 .tecadmin.net.out
starting jobtracker, logging to /opt/hadoop/hadoop/libexec/../logs/hadoop-hadoop-jobtracker-ns1.tecadmin.net.out
localhost: starting tasktracker, logging to /opt/hadoop/hadoop/libexec/../logs/hadoop-hadoop-tasktracker-ns1.tecadmin.ne t.out

Step 7: Test and Access Hadoop Services
Use ‘jps‘ command to check if all services are started well.
$ jps
or
$ $JAVA_HOME/bin/jps
26049 SecondaryNameNode
25929 DataNode
26399 Jps
26129 JobTracker
26249 TaskTracker
25807 NameNode
Web Access URLs for Services
http://srv1.tecadmin.net:50030/ for the Jobtracker
http://srv1.tecadmin.net:50070/ for the Namenode
http://srv1.tecadmin.net:50060/ for the Tasktracker
Hadoop JobTracker:
Hadoop Namenode:
Hadoop TaskTracker:
Step 8: Stop Hadoop Services
If you do no need anymore hadoop. Stop all hadoop services using following command.
# bin/stop-all.sh


3. Defined Goal

Agenda for 5.00 pm meeting on 10th Apr, 2015:
a) Sharing a single line of log file
b) Expected output i.e. To make three graphs out of it
             1. Past records analysis
             2. Showing in-progress data (Dashboard)
             3. Future load projection
c) Immediate action plan
    * Install Hadoop in 2 systems (so that we can test scaling) - Need 2 volunteers
    * Port data - Need 1 volunteer  to learn and share what needs to be done
    * Analyze using map-reduce to get graphs (1) - Need 1 volunteer to identify how to do this
d) Next meet on coming Wednesday, 5.00 pm
e) A brief Q & A session

Sample log lines:
[00001] 2015-04-10 11:21:39 [Root]system-notification-00257(traffic): start_time="2015-04-10 11:21:39" duration=0 policy_id=11 service=http proto=6 src zone=Trust dst zone=Untrust action=Permit sent=0 rcvd=0 src=192.168.16.236 dst=62.67.193.31 src_port=43855 dst_port=80 translated ip=111.93.148.154 port=1101
[00002] 2015-04-10 11:21:39 [Root]system-notification-00257(traffic): start_time="2015-04-10 11:21:39" duration=0 policy_id=7 service=NETBIOS (NS) proto=17 src zone=Trust dst zone=VPN action=Permit sent=0 rcvd=0 src=192.168.16.87 dst=192.168.0.32 src_port=137 dst_port=137 translated ip=192.168.16.87 port=137 
:
: 
[00115] 2015-04-10 11:21:43 [Root]system-notification-00257(traffic): start_time="2015-04-10 11:21:43" duration=0 policy_id=11 service=dns proto=17 src zone=Trust dst zone=Untrust action=Permit sent=0 rcvd=0 src=192.168.16.12 dst=205.251.192.60 src_port=40901 dst_port=53 translated ip=111.93.148.154 port=1772 
:
 
Quick summary of our meeting:
We briefly discussed the problem i.e. Analyze router log (mail files) and make sense out of it using Hadoop.

Our primary purpose is to learn analysis of huge amounts of data using Hadoop, with a specific deliverable, so that we can have problem to tackle and solve.

c) Immediate action plan
   * Install Hadoop in 2 systems (so that we can test scaling) : Need 2 volunteers ::
      - 10th Apr '15: We got Pavithra, Radhakrish & Prateek.
      - Kiran Patil will assist us in the installation, if there are any issues.
      - Kiran Patil also offered to create a VM, on which he will install the Hadoop.
      - Hadoop will be installed in local system itself, provided the needed version java does not hinder with their project java.

   * Port data - Need 1 volunteer to learn and share what needs to be done
      - Kiran Kumar, Sangamesh & Prateek volunteered.
      - What needs to be done? If there is data in a file (sample format shown below), how to import it into Hadoop environment?

   * Analyze using map-reduce to get graphs (1) - Need 1 volunteer to identify how to do this
      - Mustafa has volunteered to learn about map reduce and share it with the team.
      - ?: If there is data in hadoop, how to read it (map reduce it) for further processing?

Few questions discussed:- Sangmesh mentioned that we might need at least 3 systems, for apache hadoop to be running. So, three volunteers were needed and we got it.

? Why Hadoop?
! We needed a environment that allows us to import and analyze data. Hadoop was chosen in view of its non restrictive license.
? How to handle duplicates, if any, in input data?
! Logically, there won't be any duplicate. If there are any, we need to handle them.. [How??? That was the question... Huh?? I couldn't hear you.. I guess your cell phone tower signal is weak].
? What is the deliverable for Wednesday? Is it the graph itself?
! No. Graph 1 is not the deliverable for Wednesday. We will meet on Wednesday to share our learnings and then take it forward on how to apply them.

What more?
Gowri & Giri will prepare a quick time line and share it on Wednesday.

2. Responses

Mustafa: Count me in.
Radhakrishna:  I would like to attend the meeting.
Pavithra:  Please count me in.
Kiran Kumar: Yes Gowri, I am interested.
Prateek: I am Interested Gowri..

1. EOI mail

Hi,
Interested in learning Hadoop?

Our Kiran Patil has a small problem in his sys admin domain that needs a solution. I'm planning to work on Hadoop in my free time to address it.

Expected completion: Mid May, 2015.
Kick off meeting: 5.00 pm today.

If you are interested, pl reply back to this mail before 3.00 pm today.

Thanks & Regards,
n gowrisankar.

PS: Kiran >> You are included already, so no need to reply to this mail :).