登录查看更多内容

WordCounter in Hadoop! (Windows PRACTICAL)

Shubham Kumar Gupta

SDE Android @ OLA | Research Intern @ UiT-Norway | 8th Inter IIT Tech Gold | Ex-SDE Team Lead Cueweb | Google Hash Code AIR 19 | GCI'19 Mentor | TechGiG AIR 24 | Bug Hunter | BLE | 16k+ LinkedIn Family ??

发布日期: 2022年1月8日

+ 关注

Hey! This is Shubham, and I am back with another tech writeup! HADOOP

I and Ram were once thinking of exploring some new tech! So, we thought of exploring Hadoop!?

Ram: Hey! Shubham, what the heck is Hadoop? I have seen this in almost every requirement in jobs nowadays!?

Me: Sure Ram, I have seen some videos and made small projects on it, will tell you about Hadoop and also how I did it!

Ram: Also, when I tried to set up I faced a lot of errors ??, can you tell that too?

Me: Sure! there will be an easy bonus for you (Github repo ??)

Ram: Oh! Nice! But can you tell that Hadoop ever benefited anyone till now, as very few people study regarding this?

Me: Cool let me take an example of Walmart, in 2004 there was a hurricane coming in France, so people at Walmart studied using big data and Hadoop and it resulted that before hurricane people buy emergency stuffs and strawberry pop-tarts, so Walmart filled into their stores and that increased sales 7 times

Me: Cool Let’s start then,

After reading about Hadoop and its functioning, I think now I can define Hadoop in layman terms as “It’s an opensource software framework storing and managing BIG DATA developed by Apache which can help one to analyze and retrieve vital information from those datasets and can ease scaling, management and pricing?!”

Ram: Hey, What’s Big Data then?

Me: See Big data as the name suggests a collection of a large number of datasets that can be structured, semi-structured, or unstructured.?

As the online user base is increasing day by day this is generating lots of data. An example can be of every second temperature fluctuations recording of weather all day, other examples can be user tapping which locations on pages and which made user close the app or go for buy button, or which post one likes how many clicks generally a user does in a website, etc.

Me: Now, let's get back to Hadoop, This follows a Master-Slave architecture, and storage in Hadoop is done using HDFS, it’s distributed file system of Hadoop that provides high throughput access to application data. In brief, HDFS is a module in Hadoop.?

I feel storage in HDFS is like a doubly-linked list type, it stores data by diving it into multiple blocks with the same maximum limit (128GB) and stores each block at a total of 3 locations(DataNodes). There is a NameNode, a Secondary Node, and DataNodes.?

If we write this, then data is stored at the current location and two more, similarly, Hadoop saves so it becomes fault-tolerant.

class doubly_linked_list{

?int data;

doubly_linked_list* next;

doubly_linked_list* prev;

}

i) Client Requests NameNode it returns IP address of free DataNodes?

ii) Assuming a file of 200GB with two blocks as [BLK A(128GB) and BLK B(200–128=72GB)]

iii) Checks all nodes are ready and free, now writes BLK A and BLK B, at 3 locations.?

NameNode and Secondary NameNode can be thought of as backups of NameNode

i) Name node is the one which stores the information of HDFS filesystem in a file called FSimage

ii)Secondary NameNode is the checkpoints of the file system metadata present on NameNode

Me: Now Let me tell you how Hadoop makes the system easy to handle, it has a feature named Map Reduce,?

Hadoop MapReduce is a software framework for easily writing applications that process vast amounts of data in parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

Let’s take an example of a kitchen, there is only one chef and one box to take ingredients when orders are less than its easy for that chef to do the task but it orders increases then it will be problematic,

Now let's assume we hired 5 more chefs, this will solve the issue, right? No, because still there is only one box so the rest all will be in a queue unless one takes the ingredient. So what we can do??

i) We can think of it as mapped and reduced terms. we have total of 6 chefs. ??

ii) 2 chefs will make meat, 2 will make sausage

iii) Now 2 head chefs will assemble all, this will solve our issue!

Now chefs are happy! Cool

领英推荐

Moving on from Outmoded Hadoop

Jack Gold 1 年前

?? Hadoop Made Easy: Fix Common Errors and Install it…

Vishnukanth k 2 个月前

Hadoop 2.x

Aqib Javed 5 年前

Ram: Nice, But I’m tired now can you show me how to build something??

Me: ??Sure, Let’s build a WordCounter, So, I followed this! Make sure you follow precisely as in windows you will face a lot of errors!?

In the WordCounter system, we will have a words.txt(maybe in TB’S ??)and a jar function and we will pass this using Hadoop over a map-reduce the functionality, and finally return the counts.

So, what it does is first it segregates/splits the sentences, then it maps different words present, then shuffles to bring all words together, then reduces it, and then we get out Final result.

Let's dive into the coding section

This is how our basic code segments look like

import java.io.IOException
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

So, here basically we have worked upon tokenizing our input data using StringTokenizer, then we passed over a mapper function where it generates output as a key-value pair, the default value here used is 1, then it is shuffled and sorted and passed over a reduce function, here we summed all the values of key-value pairs and returned that.?

Done!

Ram: Wow! That was easy can you tell me how to setup? I faced a lot of errors ??.

Me: Sure Ram, Here you go lets make your hands dirty! ??

i) First install jdk1.8 at location ‘D:\Java\jdk1.8.0_202’

ii) Now install git bash to your system.

iii) Now download apache hadoop.tar.gz extract using tar -xvf hadoop.tar.gz the command at ‘D:\hadoop’

iv) Now download apache maven.tar.gz, extract at ‘D:\apache-maven’

v) Add this to environment path (control panel>Settings>Advanced>EnvironmentVariable)

vi) Install Eclipse IDE for Java, now you have to edit some files this may be hectic and cause of the error, so I wrote all for you, simply clone from GitHub and copy bin, sbin, etc, jars(All files needed in eclipse) folder files to installed in Hadoop location [gptshubham595/hadoop-windows: Hadoop installation in Windows (github.com)]

vii) Create data named folder inside Hadoop or just copy folder there

viii) Now run this command "hadoop namenode -format" in terminal

ix) Now run this command inside the terminal to start Hadoop "start-all.cmd"?

x) Now export wordcount.jar with JDK 1.8 see code here[Github]

xi) Now create a word.txt file with words

mango beer beer

beer beer

mango mango chicken

chicken soup mango beer

fish soup mango beer

xii) Follow this syntax to create a dir inside hdfs system "hdfs dfs -mkdir /inputdir" then put this word.txt inside this dir using this syntax as : hdfs dfs -put word.txt /inputdir?"

xiii) Now run using jar file in terminal "hadoop jar D:\WordCount.jar WordCount /inputdir/word.txt /outputdir?"

xiv) Now simply check output like this "hdfs dfs -ls /ouputdir?" , Now choose the file created to see contents as "hdfs dfs -cat /ouputdir/part-r-00000"?

xv) This shows

beer 6

chicken 2

fish 1

mango 5

soup 2

Done! This is how simply you can create and run a simple WordCounter program in Hadoop.

Thank You!

PS — I‘m a newbie in this may be I missed many things or maybe the same thing can be explained in a much easier way, I tried my best and my understanding to share with you all!?

要查看或添加评论，请登录

Shubham Kumar Gupta的更多文章

Understanding Concurrency with Synchronize, Volatile, and Atomic!

2024年5月1日

Understanding Concurrency with Synchronize, Volatile, and Atomic!

Concurrency in Android development refers to the ability of an application to execute multiple tasks simultaneously…
Easy-Peasy Implement a C++ library in your project via Android NDK CMake ??[PRACTICAL]

2023年6月3日

Easy-Peasy Implement a C++ library in your project via Android NDK CMake ??[PRACTICAL]

Will try to deliver this as a context between two “close” friends Sheela and Mohan! Mohan: Mutual Funds: ???????? ???…

4 条评论
How NOT to miss your JOB emails? (PRACTICAL TIP)

2022年4月2日

How NOT to miss your JOB emails? (PRACTICAL TIP)

Abhay: Arey yaar! Not again! Me: Hey, Abhay! You seem to be in trouble, Is there any issue? Abhay: I again missed this…
Developing a File Sharing Platform using Azure ! (PRACTICAL)

2022年1月29日

Developing a File Sharing Platform using Azure ! (PRACTICAL)

Hey! I’m back with another blog, we will be exploring Azure in this medium. Simran: Hey! I have heard of Azure, and…
Crypto Tweets Fetch using Flume & Hadoop (PRACTICAL)

2022年1月23日

Crypto Tweets Fetch using Flume & Hadoop (PRACTICAL)

Simran: Hey! I am new to investing in cryptocurrency. Me: Nice! At least you started investing! That’s good! Simran:…
How Machine Learning is Solving Real-Life Problems

2019年7月25日

How Machine Learning is Solving Real-Life Problems

-This article is a part of the first assignment in the AITS Machine Learning Engineer Internship. Instead of boring…

3 条评论

See all articles

WordCounter in Hadoop! (Windows PRACTICAL)

Shubham Kumar Gupta

SDE Android @ OLA | Research Intern @ UiT-Norway | 8th Inter IIT Tech Gold | Ex-SDE Team Lead Cueweb | Google Hash Code AIR 19 | GCI'19 Mentor | TechGiG AIR 24 | Bug Hunter | BLE | 16k+ LinkedIn Family ??

I and Ram were once thinking of exploring some new tech! So, we thought of exploring Hadoop!?

领英推荐

Shubham Kumar Gupta的更多文章

社区洞察

其他会员也浏览了

Unleashing the Power of Big Data with Hadoop

Frequently Asked Hadoop Questions

Understanding the Varied Components of Hadoop and Benefits!

Hadoop Usage in Data Analytics: An Overview

Hadoop Overview:

Hadoop Gets Tamed!

3 Solutions for Big Data’s Small Files Problem !

Expertzlab the best IT finishing school in Kerala

Hadoop Online Job Support: A Lifeline for Big Data Professionals

Let’s research and the world the know about the Myths of Hadoop

I and Ram were once thinking of exploring some new tech! So, we thought of exploring Hadoop!?

领英推荐

Shubham Kumar Gupta的更多文章

Understanding Concurrency with Synchronize, Volatile, and Atomic!

Easy-Peasy Implement a C++ library in your project via Android NDK CMake ??[PRACTICAL]

How NOT to miss your JOB emails? (PRACTICAL TIP)

Developing a File Sharing Platform using Azure ! (PRACTICAL)

Crypto Tweets Fetch using Flume & Hadoop (PRACTICAL)

How Machine Learning is Solving Real-Life Problems

社区洞察

其他会员也浏览了

Unleashing the Power of Big Data with Hadoop

Frequently Asked Hadoop Questions

Understanding the Varied Components of Hadoop and Benefits!

Hadoop Usage in Data Analytics: An Overview

Hadoop Overview:

Hadoop Gets Tamed!

3 Solutions for Big Data’s Small Files Problem !

Expertzlab the best IT finishing school in Kerala

Hadoop Online Job Support: A Lifeline for Big Data Professionals

Let’s research and the world the know about the Myths of Hadoop