WordCounter in Hadoop! (Windows PRACTICAL)

WordCounter in Hadoop! (Windows PRACTICAL)

Hey! This is Shubham, and I am back with another tech writeup! HADOOP

I and Ram were once thinking of exploring some new tech! So, we thought of exploring Hadoop!?

Ram: Hey! Shubham, what the heck is Hadoop? I have seen this in almost every requirement in jobs nowadays!?

Me: Sure Ram, I have seen some videos and made small projects on it, will tell you about Hadoop and also how I did it!

Ram: Also, when I tried to set up I faced a lot of errors ??, can you tell that too?

Me: Sure! there will be an easy bonus for you (Github repo ??)

No alt text provided for this image

Ram: Oh! Nice! But can you tell that Hadoop ever benefited anyone till now, as very few people study regarding this?

Me: Cool let me take an example of Walmart, in 2004 there was a hurricane coming in France, so people at Walmart studied using big data and Hadoop and it resulted that before hurricane people buy emergency stuffs and strawberry pop-tarts, so Walmart filled into their stores and that increased sales 7 times

Me: Cool Let’s start then,

After reading about Hadoop and its functioning, I think now I can define Hadoop in layman terms as “It’s an opensource software framework storing and managing BIG DATA developed by Apache which can help one to analyze and retrieve vital information from those datasets and can ease scaling, management and pricing?!”

Ram: Hey, What’s Big Data then?

No alt text provided for this image

Me: See Big data as the name suggests a collection of a large number of datasets that can be structured, semi-structured, or unstructured.?

As the online user base is increasing day by day this is generating lots of data. An example can be of every second temperature fluctuations recording of weather all day, other examples can be user tapping which locations on pages and which made user close the app or go for buy button, or which post one likes how many clicks generally a user does in a website, etc.

Me: Now, let's get back to Hadoop, This follows a Master-Slave architecture, and storage in Hadoop is done using HDFS, it’s distributed file system of Hadoop that provides high throughput access to application data. In brief, HDFS is a module in Hadoop.?

I feel storage in HDFS is like a doubly-linked list type, it stores data by diving it into multiple blocks with the same maximum limit (128GB) and stores each block at a total of 3 locations(DataNodes). There is a NameNode, a Secondary Node, and DataNodes.?

If we write this, then data is stored at the current location and two more, similarly, Hadoop saves so it becomes fault-tolerant.

class doubly_linked_list{
?int data;
doubly_linked_list* next;
doubly_linked_list* prev;
}
No alt text provided for this image

i) Client Requests NameNode it returns IP address of free DataNodes?

ii) Assuming a file of 200GB with two blocks as [BLK A(128GB) and BLK B(200–128=72GB)]

iii) Checks all nodes are ready and free, now writes BLK A and BLK B, at 3 locations.?

NameNode and Secondary NameNode can be thought of as backups of NameNode

No alt text provided for this image

i) Name node is the one which stores the information of HDFS filesystem in a file called FSimage

ii)Secondary NameNode is the checkpoints of the file system metadata present on NameNode


Me: Now Let me tell you how Hadoop makes the system easy to handle, it has a feature named Map Reduce,?

Hadoop MapReduce is a software framework for easily writing applications that process vast amounts of data in parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

No alt text provided for this image

Let’s take an example of a kitchen, there is only one chef and one box to take ingredients when orders are less than its easy for that chef to do the task but it orders increases then it will be problematic,

No alt text provided for this image

Now let's assume we hired 5 more chefs, this will solve the issue, right? No, because still there is only one box so the rest all will be in a queue unless one takes the ingredient. So what we can do??

No alt text provided for this image

i) We can think of it as mapped and reduced terms. we have total of 6 chefs. ??

ii) 2 chefs will make meat, 2 will make sausage

iii) Now 2 head chefs will assemble all, this will solve our issue!

Now chefs are happy! Cool

No alt text provided for this image

Ram: Nice, But I’m tired now can you show me how to build something??

Me: ??Sure, Let’s build a WordCounter, So, I followed this! Make sure you follow precisely as in windows you will face a lot of errors!?

No alt text provided for this image

In the WordCounter system, we will have a words.txt(maybe in TB’S ??)and a jar function and we will pass this using Hadoop over a map-reduce the functionality, and finally return the counts.

So, what it does is first it segregates/splits the sentences, then it maps different words present, then shuffles to bring all words together, then reduces it, and then we get out Final result.

Let's dive into the coding section

No alt text provided for this image

This is how our basic code segments look like

import java.io.IOException
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}        

So, here basically we have worked upon tokenizing our input data using StringTokenizer, then we passed over a mapper function where it generates output as a key-value pair, the default value here used is 1, then it is shuffled and sorted and passed over a reduce function, here we summed all the values of key-value pairs and returned that.?

Done!

Ram: Wow! That was easy can you tell me how to setup? I faced a lot of errors ??.

Me: Sure Ram, Here you go lets make your hands dirty! ??

No alt text provided for this image

i) First install jdk1.8 at location ‘D:\Java\jdk1.8.0_202’

ii) Now install git bash to your system.

iii) Now download apache hadoop.tar.gz extract using tar -xvf hadoop.tar.gz the command at ‘D:\hadoop’

iv) Now download apache maven.tar.gz, extract at ‘D:\apache-maven’

v) Add this to environment path (control panel>Settings>Advanced>EnvironmentVariable)

No alt text provided for this image

vi) Install Eclipse IDE for Java, now you have to edit some files this may be hectic and cause of the error, so I wrote all for you, simply clone from GitHub and copy bin, sbin, etc, jars(All files needed in eclipse) folder files to installed in Hadoop location [gptshubham595/hadoop-windows: Hadoop installation in Windows (github.com)]

vii) Create data named folder inside Hadoop or just copy folder there

viii) Now run this command "hadoop namenode -format" in terminal

ix) Now run this command inside the terminal to start Hadoop "start-all.cmd"?

x) Now export wordcount.jar with JDK 1.8 see code here[Github]

No alt text provided for this image


xi) Now create a word.txt file with words

mango beer beer
beer beer
mango mango chicken
chicken soup mango beer
fish soup mango beer

xii) Follow this syntax to create a dir inside hdfs system "hdfs dfs -mkdir /inputdir" then put this word.txt inside this dir using this syntax as : hdfs dfs -put word.txt /inputdir?"

xiii) Now run using jar file in terminal "hadoop jar D:\WordCount.jar WordCount /inputdir/word.txt /outputdir?"

xiv) Now simply check output like this "hdfs dfs -ls /ouputdir?" , Now choose the file created to see contents as "hdfs dfs -cat /ouputdir/part-r-00000"?

xv) This shows

beer 6
chicken 2
fish 1
mango 5
soup 2
No alt text provided for this image

Done! This is how simply you can create and run a simple WordCounter program in Hadoop.

Thank You!

PS — I‘m a newbie in this may be I missed many things or maybe the same thing can be explained in a much easier way, I tried my best and my understanding to share with you all!?

要查看或添加评论,请登录

Shubham Kumar Gupta的更多文章

社区洞察

其他会员也浏览了