Hadoop - Managers' snapshot
Source: Internet

Hadoop - Managers' snapshot

?

This is my first “blog” of 2023. The goal for this blog is to create a “Snapshot for Manager” view of Apache Spark and Hadoop. This is my first iteration of the same where I made an attempt to cover the Hadoop balcony view.

Before I proceed, a disclaimer – the article relies heavily on internet research and borrows pictures, diagrams, or texts available on internet, all for a non-commercial purpose – pure community learning. Wherever possible – I have tried to give credit to the source. If credits are missing for sections or pictures or diagrams or texts – it is simply a matter of oversight. The objective is to provide a useful summary for managers and business leaders who don’t have time to get their hands dirty with details and yet seek to have a grasp of the subject matter which is more than just a cursory level understanding.

With that, let’s dive.


Source: Internet

?

?

First, let me begin with the “why” – why was Hadoop needed. Seeking to understand the “why” could be powerful to understand and appreciate a technology and grow an intuitive sense of the same.

The “Why”:

The first “why” - Explosion of Data: There has been this whole gamut of technology shifts, digitization, business models & customer behavior shift powered by the technologies and societal changes. eCommerce, cheaper sensor technologies (IoT), cameras, smart phones, social media are few examples of sources of data causing explosion.

The second “why” - Marketplace and/or consumer of the explosive data growth: ?The analytics and data sciences are the new kings. ML is the key engine of analytics. The ML of the AI/ML duo has started thriving on data. With the neural network and deep learning technique, the more the data, better the learning.


Source: Internet


So, what do we have as a result of all of these:

  1. Volume - Large amount of data (from terabytes to petabytes) to be processed
  2. Variety – Structured, unstructured – text, images and everything in between
  3. Velocity - The speed with which data are being generated. Every day 900 million photos are uploaded on Facebook, 500 million tweets are posted on Twitter, 0.4 million hours of video are uploaded on YouTube and 3.5 billion searches are performed in Google

?

This is the genesis of “Big Data”. Essentially, people (it started with the Googles of the world who had to deal with huge amount of data in the immediate term) had lots of data and it needed to be analyzed at an affordable resource cost – compute, time.

And here came the third “why”:

The third “why” - Cloud computing. The elastic power of the Cloud (you can spin on or off compute resources at will and at the drop of hat) and commoditization & democratization of compute power (these compute resources on Cloud are cheap and affordable @unit of 1 level) at scale at disposal of everyone – all powered by the technology developments in the world of virtualization, containerization and high speed affordable connectivity.

Now that we have built (hopefully) some sense of the “why”, let’s shine light on “how”.?

The hard problem at hand: You have the ingredients. But we need to cook them to feed our hunger.

Deal with gigantic amount of Data – store them, analyze them and find out actionable insights. Do the same at scale. Do the same at affordable cost and do the same at speed? to be effective.

Solving the storage problem first: Introducing HDFS

Source: Internet


A basic problem of “Big Data” oversimplified and used as hypothetical situation to lead to the first foundational intuition/step of “Big Data” – if you have 256 GB of data that needs to be processed in a different world where your desktop computer can store only 64GB of data in the hard drive, ?how do you deal with the storage of the data even before you can attack the analysis part:?

You could solve it by slicing off 8 32GB pieces of the data and storing these pieces in a cluster of 8 desktop machines each having 64GB of storage capacity (you want to leave another 32GB in each machine for other usage and not just grab everything that you got). ?Basically, you could divide and conquer by distributing the file throughout the cluster of desktop machines. This is the first seed of a cluster computing framework that comes over and over in “Big Data” processing framework. How we can create a cluster of ordinary inexpensive computers and exploit the power of distributed storage and distributed (and therefore, parallel) processing and build redundancy to solve the “Big Data” problem in an affordable way.

Source: Internet

In the world of Hadoop, they designed a File System called HDFS (Hadoop Distributed File System) that uses the combined storage power of all the nodes of the cluster of inexpensive computer. HDFS is a user-space file system and depends upon the underlying native file system for actual storage.

Addressing the storage problem by dividing and conquering:

Source: Internet


Suppose we divide the 256GB file into 4K blocks and save them as file splits (file1, file2 and so on) as 4K is typically the lowest block size of the underlying file system in computers. However, that would mean that reading the whole file would mean seeking the file start for 256 GB/4K = 6,71,08,864 splits. It would mean 6,71,08,864 number of parallel processes seeking the file starting point and that would mean significant time in seeking file start point.

Also, keeping in mind that you would have to do bookkeeping of all the blocks that you are storing (say block1 = file1 and file1 is stored in datanode1, block2 = file2 and file2 is stored in datanode2 and so on) and the metadata size would be huge, the size of metadata needed to keep records of 6,71,08,864 files.

Now, if we make the block size too big, like 32 GB, you would minimize on the total number of seeks (256 GB/32GB = 8 splits and therefore 8 seeks) and the metadata block would be very small, just to keep track of 8 files. However, you would be limited by running 8 parallel process to read the files. So, you lose on the power of parallelism.


HDFS deals with this by balancing the block size to a level where you can have the benefit of parallelism, not too heavy not too light burden on metadata and the file seeks are also balanced. HDFS block sizes are 64 MB or 128 MB. In our case, that would mean 2048 blocks distributed over 8 nodes with 256 files/blocks stored in each machine. The total number of seeks would therefore be 2048, your primary or master node would have to have metadata for 2048 blocks/files and each machine would have 256 parallel process running. ?

Addressing the non-reliability of inexpensive HW by leveraging the power of redundancy:

Source: Internet


Remember, we have been talking about using inexpensive hardware and that comes with the price – the reliability factor. What if one of the data node fails? Well, HDFS deals with that problem by maintaining two redundant copy of each split. So, as a whole, for the scenario described above, we would have 2048 * 3 = 6144 nodes. The master or primary node would keep metadata so that it knows where we have the copy1 and copy2 for any given split.

Associated with the concept of split is the concept of heartbeat. So, each datanode would have a system of sending heartbeat to the primary node so that the master knows when the read/write/processing on a split needs to be moved to a different node.

Source: Internet


Overall, as you can see, in a Hadoop cluster, we have many data node or machines and a master node or machine. The master node by definition needs to be more expensive, reliable. Each data node or machine (also known as a slave node) runs a background process named DataNode. This background process (also known as a daemon) keeps track of the slices of data that the system stores on its computer. It regularly talks to the master server for HDFS (known as the NameNode) to report on the health and status of the locally stored data. Data blocks are stored as raw files in the local file system. From the perspective of a Hadoop user, you have no idea which of the slave nodes has the pieces of the file you need to process. From within Hadoop, you don’t see data blocks or how they’re distributed across the cluster — all you see is a listing of files in HDFS. The complexity of how the file blocks are distributed across the cluster is hidden from you — you don’t know how complicated it all is, and you don’t need to know. Actually, the slave nodes themselves don’t even know what’s inside the data blocks they’re storing. It’s the NameNode server that knows the mappings of which data blocks compose the files stored in HDFS.


Before we move on to the next aspect of Hadoop, let’s spend a little time on the how HDFS handles fault tolerance and speed by focusing quite a bit on how to place the file blocks and their replicas.

Source: Internet

Will begin by explaining a term called “Rack”.

HDFS & Hadoop, you may recall, is Cluster Computing Framework. By extension of this premise, HDFS Cluster is a collection of racks. The “rack” is a collection of DataNodes that are close to each other and are connected to the same network switch. Network bandwidth between any 2 DataNodes within the same rack is greater than the 2 DataNodes in different racks.

?

Remember we mentioned about replicas and fault tolerance? But to be able to achieve these, the HDFS needs to be aware of the topology and be intentional in placing the file splits and their replicas in intelligent manner such that the desired outcomes could be achieved. For example, in the picture above, if one of the switch goes down, all datanodes under that switch would not be available. What if the rack has a given file split and its’ two replicas (remember? By default, we have total 3 copies of each file split), then we are in trouble. So, to deal with fault-tolerance, you don’t want to place all 3 copies on the same “rack”.

So, what do we do? We could then adopt a strategy where 3 copies are on three different “racks”. But now, deal with the performance and network-bandwidth issue. Here is an example: when you copy the data to 3 different file split, you haul the data to three different racks. And you do that for all the file splits. Be mindful that there is a network latency/bandwidth advantage if two replicas are kept on the same rack. Because, hauling the data over network now would be optimum. So, you take the second principle of placing two replicas on the same rack. Can this rack too fail? Remember, if the first rack fails and the second rack fails too, all 3 copies are gone. But wait, how likely is that to happen? Well, isn’t that less likely? The name of the game is optimization and you ought to take some risk.

But wait a minute – the second rack failing after the first rack failure is unlikely. But how about a data node on the second rack failing? Well, that probability is higher. Right? And so, you adopt the third principle – two replicas on the same rack but not on the same data node!

So, there you have the rack awareness principle used by HDFS. In a formal language, they put it like this:

  • Not more than one replica be placed on one node.
  • Not more than two replicas are placed on the same rack.
  • Also, the number of racks used for block replication should always be smaller than the number of replicas

Example of rack awareness in play in the picture below:

Source: Internet

?

The key tenets of HDFS: Low-cost, Fault-tolerant, Scalable

Low-cost: For obvious reasons. As you can see, we are talking about inexpensive hardware. Basically, low cost compute cluster.

Source: Internet

Fault-tolerant: As you can see, HDFS has redundant copies to support fault-tolerance.

Source: Internet


Scalable: A simple primer first. In brief, Scaling up is adding further resources, like hard drives and memory, to increase the computing capacity of physical servers or machines. Whereas scaling out is adding more servers/machines to your architecture to spread the workload across more machines. Now that we are armed with this simple primer, do we see how HDFS allows you to scale out as opposed to asking for scaling up? All you need to do is to add more machines to the cluster. HDFS subsystem would leverage the increase and offer increased capacity.

Source: Internet


Finally, HDFS uses sequential read and write within each split/block to make the read/write faster.


Source: Internet

After we’ve stored large amount of data in HDFS (a distributed storage system spread over an expandable cluster of individual slave nodes), the next question is: “How can I take advantage of cluster of inexpensive compute and storage to analyze or query the data faster?”

First and foremost, if we want the data to travel to a central machine (NameNode?) for processing would take a lot of time.

Next, any centralized processing would mean you are underutilizing the datanodes’ compute power and overusing the compute of a centralized machine.

Finally, having to process the data sequentially almost implies sequential processing of data and that would take a lot of time.

Let’s see how Hadoop solved that issue:

First and foremost, if we want the data to travel to a central machine (NameNode?) for processing would take a lot of time. They solved it by taking an approach that takes the code to where data resides. Run the processing code where data resides is the mantra.

?

Next, any centralized processing would mean you are underutilizing the datanodes’ compute power and overusing the compute of a centralized machine. They solved it by making use of the compute resources of the datanode as is evident in the point above. Run the processing code where data resides is the mantra.

Finally, having to process the data sequentially almost implies sequential processing of data and that would take a lot of time. They solved it by making the architecture focused on parallel processing – running of the code in the datanodes can and will be done in parallel.

Thinking and working in Parallel:


?

Before we delve into an actual case, I would like to take a pause here and shine light on the genesis, history, and virtues of parallel programming in the context of Hadoop or big data processing framework.

?

It helps in refreshing our memory that Hadoop & Spark is about processing large amount of data at low cost meeting all business requirements such as speed, cost etc. Hadoop took the first stab at solving it and solving the problem had two parts: Solving the storage issue – that was solved by making use of two fundamental constructs: 1. Divide and conquer – split the data into smaller chunks and conquer the challenge of storing the same, 2. Storing using Cluster Computing Framework – Cluster of commodity compute resources as opposed to expensive servers etc for storage, 3. Scaling out as opposed to scaling up.

The next part was the challenge of solving the compute part – this was solved using the patterns suitable for parallel processing – map, filter, reduce that existed from the days of LISP functional programming. It may help to refresh that functional programming always lent itself to parallel programming.

If you don’t remember what’s functional programming all about, it is time now for a quick refresh.

Functional programming is about pure functions – functions without side effects. Functional programming is about pure functions.? What are pure functions: Simply, a pure function is a function that is deterministic and doesn’t produces side effects.

Source: Internet

?

Source: https://blog.jenkster.com/2015/12/what-is-functional-programming.html

What is a 'Pure Function'?

A function is called 'pure' if all its inputs are declared as inputs - none of them are hidden - and likewise all its outputs are declared as outputs.

In contrast, if it has hidden inputs or outputs, it's 'impure', and the contract we think the function offers is only half the story. The iceberg of complexity looms. We can never use impure code "in isolation". We can never test it in isolation. It always depends on other things which we have to keep track of whenever we want to test or debug.

?

What is 'Functional Programming'?

Functional programming is about writing pure functions, about removing hidden inputs and outputs as far as we can, so that as much of our code as possible just describes a relationship between inputs and outputs.

We accept that some side-effects are inevitable - most programs are run for what they do rather than what they return, but within our program we will exercise tight control. We will eliminate side-effects (and side-causes) wherever we can, and tightly control them whenever we can't.

Or put another way: Let's not hide what a piece of code needs, nor what results it will yield. If a piece of code needs something to run correctly, let it say so. If it does something useful, let it declare it as an output. When we do this, our code will be clearer. Complexity will come to the surface, where we can break it down and deal with it.

?

Characteristic #1: ?Pure functions have to be Deterministic

A deterministic function is a function that for same input x output should always be the same y. It will never change its mind on the same input.

Examples of nondeterministic functions:

Source:

https://blog.bitsrc.io/functional-programming-part-2-pure-functions-85491f3d7190

const random = () => Math.random()

const getDate = () => Date.now()

const getUsers = await fetch('/users'){ users might have updated, there is not internet connection or the server might be down etc}        

Characteristic #2: A pure function must have no side effects

Where a side effect can be either an: External dependency (access to external variables, I/O streams, reading/writing files or making http calls). Or a mutation, (mutations on local/external variables or on passed reference arguments).

With side effect (Impure):

let min = 60

const isLessThanMin = value => value < min        

Pure:

const isLessThanMin = (min, value) => value > min        

Why impure? external dependency.

Solution: dependency injection as you can see above

With side effect (Impure):

const squares = (nums) => 
{ ???
for(let i = 0; i < nums.length; i++) nums[i] **= 2;
}        

Pure:

const squares = (nums) => nums.map(num => num * num)        

Why impure? Function does mutations on original referenced array

Solution: use the functional .map instead. Which creates a new array at the end

Impure:

const updateUserAge = (user, age) => {? user.age = age}        

Pure:

const updateUserAge = (user, age) => ({ ...user, age })        

Why impure? mutation on passed user reference

Solution: refrain from mutating passed object, instead return a new object

Here are some more explanation that would make for a good reading:

https://blog.jenkster.com/2015/12/what-is-functional-programming.html

But you get the point, right?

To quote the same blog, Why pure functions are good?

Development side:

  • Predictability: elimination of external factors and changes of environment will make functions more sane and predictable.
  • Maintainability: more predictable functions are easier to reason about. And that will reduce state assumption and developers’ cognitive-load.
  • Composability: independence of functions and communicating only through input and output, that will allow us to compose functions easily.
  • Testability: self-containment and independence of functions will take testability to the moon.

In terms of Application Performance:

  • Cache-ability (Memoization): determinism of functions will give us the ability to predict what the output is going to be through the input (since each input will have a defined output), then we can cache functions based on inputs.
  • Parallelize-ability: since functions are now side-effects free and independent, they can be easily parallelized.

We are not interested here about the other great benefits, but definitely interested in one. Can you guess? Yes, you got it right (if I read your mind correctly, that is ??) - Parallelize-ability.

How are you going to write reliable code that works on the splits of the large data that we have spread across the cluster? The answer is: using the constructs of functional programming. And that is the next major construct of Hadoop and Spark.

First, if we lay out the picture once more, we have, for quick but affordable way of storing and processing very very large amount of data, we already took the first step of splitting and storing the data in cluster of computers. Of course, now, we want to take the compute to the data. And there you bring the context – context for doing things in parallel. You want to split the data into smaller chunks. And then, you want to process the smaller chunks. So, you want stateless, pure functions that can safely work across hundreds or thousands of splits of the same large data in parallel. Time again for big round of applause for functional programming – bring home the pure functions for processing these splits in parallel.

Let’s look at three key functional programming construct heavily leveraged in Hadoop:?

Source:

https://towardsdatascience.com/pythons-map-filter-and-reduce-functions-explained-2b3817a94639

?

A clarification - higher-order functions are functions that take another function as input, making them powerful general purpose expressions. In mathematical terms, one could think of a derivative d/dx, taking a function f as input.

?

Map( )

Being a higher-order function, the map function takes another function and an iterable (e.g., a list, set, tuple) as input, applies the function to the iterable, and returns an output. It’s syntax is defined as follows:? map(function, iterable). For those mathematically inclined, it may be convenient to think in terms of mapping from a domain X to a domain Y:

f:X →Y? or alternatively:? f(x)=y, ?x∈X ?[? = all, ∈ = member of,? i.e. f(x)=y for all x that are member of X]

For inspiring visualization, you could say:

map(cook, [??,??,?? ,??]) → [??,??,??,??]

Here is what this MIT note describes map() as:

Map

Map?applies a unary function to each element in the sequence and returns a new sequence containing the results, in the same order:

map : (E → F) × Seq<E> → Seq<F>

For example, in Python:

from math import sqrt

map(sqrt, [1, 4, 9, 16])??????? # ==> [1.0, 2.0, 3.0, 4.0]

map(str.lower, ['A', 'b', 'C']) # ==> ['a', 'b', 'c']        

map?is built-in, but it is also straightforward to implement in Python:

def map(f, seq):
??? result = []
??? for elt in seq:
??????? result.append(f(elt))
??? return result        

This operation captures a common pattern for operating over sequences: doing the same thing to each element of the sequence.

Filter()

Similar to map(), the filter() higher-order function takes a function and an iterable as inputs. The function in case needs to be of a Boolean nature, returning True/False values corresponding to the filter conditions. As output, it returns a subset of the input data that meets the conditions stipulated by the function. Mathematically, a filter operation might be loosely specified as follows:

f:X →X’, with X’?X [ ? = subset]

filter(non_vegan, [??,??,??,??]) → [??,??]

Here is how the same MIT note describes filter():

Filter

Our next important sequence operation is?filter, which tests each element with a unary predicate. Elements that satisfy the predicate are kept; those that don’t are removed. A new list is returned; filter doesn’t modify its input list.

filter : (E → boolean) × Seq<E> → Seq<E>

Python examples:

filter(str.isalpha, ['x', 'y', '2', '3', 'a']) # ==> ['x', 'y', 'a']

def isOdd(x): return x % 2 == 1

filter(isOdd, [1, 2, 3, 4]) # ==> [1,3]

filter(lambda s: len(s)>0, ['abc', '', 'd']) # ==> ['abc', 'd']        

We can define filter in a straightforward way:

def filter(f, seq):
??? result = []
??? for elt in seq:
??????? if f(elt):
??????????? result.append(elt)
??? return result        

Reduce

reduce(), yet another higher-order function, returns a single value (i.e., it reduces the input to a single element). Commonly, this would be something like the sum of all elements in a list.

The mathematical representation:? f:X →?

reduce(mix, [??,??,??,??]) →??

?

?

This medium article might make it clearer:

https://medium.com/@samims/pythons-map-reduce-and-filter-the-magic-of-functional-programming-795859846870

Reduce

The reduce function takes a function and an?iterable?as input and returns a single value. The function is applied to the first two elements of the?iterable, and the result is then applied to the next element, and so on. This process continues until all the elements of the iterable have been processed. For example, the following code uses the reduce function to sum all the numbers in a list:

# Without reduce 

numbers = [1, 2, 3, 4, 5]
sum_numbers = 0

# Without reduce: Using a for loop

for num in numbers:
??? sum_numbers += num
print(sum_numbers)? # Output: 15?        
# with reduce function 
from functools import reduce
numbers = [1, 2, 3, 4, 5]
sum_numbers = reduce(lambda x, y: x + y, numbers)
print(sum_numbers)        

Hopefully, this section above gives you with enough primer to go deploy the concepts of Map & Reduce in the world of Hadoop with a clear big picture view of whys.


Map & Reduce – the traditional way

Source: Internet

?Here is a quora comment that I liked:

“Single cook cooking an entree is regular computing.? Hadoop is multiple cooks cooking an entree into pieces and letting each cook cook her piece. Each cook has a separate stove and a food shelf. The first cook cooks the meat, the second cook cooks the sauce. This phase is called "Map". At the end the main cook assembles the complete entree. This is called "Reduce".

For Hadoop the cooks are not allowed to keep things on the stove between operations. Each time you make a particular operation, the cook puts results on the shelf. This slows things down. For Spark the cooks are allowed to keep things on the stove between operations. This speeds things up.”

Source: https://www.quora.com/How-does-Apache-Spark-and-Apache-Hive-work-together

With this background in mind, let’s dive-in.

?

Let’s say you want to do something simple, like count the number of flights for each carrier in our flight data set — this will be our example scenario for this chapter. ?Let’s assume we have a sample data set which contains data about completed flights within the United States between 1987 and 2008. We have one large file for each year, and within every file, each individual line represents a single flight. In other words, one line represents one record.

Now, remember that for a normal program that runs serially, this is a simple operation. Listing below shows the pseudocode, which is fairly straightforward: set up the array to store the number of times you run across each carrier, and then, as you read each record in sequence, increment the applicable airline’s counter.

?

Pseudocode for Calculating The Number of Flights By Carrier Serially


Let’s revisit the problem statement once again.

The hard problem at hand: You have the ingredients. But we need to cook them to feed our hunger.

Deal with gigantic amount of Data – store them, analyze them and find out actionable insights. Do the same at scale. Do the same at affordable cost and do the same at speed ?to be effective.

The previous pseudocode gives the sequential approach and we know what sequential-ism does NOT gives us. It does NOT give us SPEED.

And that takes us to the next aspect of HADOOP. Welcome to the world of parallel programming powered by functional programming. Welcome to the constructs of functional programming like MAP/FILTER & REDUCE and so on.


Source: Internet

We will begin by relooking at the same problem with a parallel processing spirit.

Pseudocode for Calculating The Number of Flights By Carrier in Parallel

Source: “Hadoop for Dummies”

The code in Listing above shows a completely different way of thinking about how to process data.? Since we need totals, we had to break this application up into phases.

?

Map phase:

The first phase is the map phase, where, at a high level, the File has been split into multiple splits (aka input splits) and mapper task in the data nodes have started working on their piece of file split.

?

The mapper task itself processes its input split one record at a time — this lone record is represented by the key/value pair (K1,V1). In the case of our flight data, ?the assumption is that each row in the text file is a single record. For each record, the text of the row itself represents the value, and the byte offset of each row from the beginning of the split is considered to be the key.

?

The mapper proceeds with the approach of treating every record in the data set individually. Here, we extract the carrier code from the flight data record it’s assigned, and then export a key/value pair, with the carrier code as the key and the value being an integer one. The map operation is run against every record in the data set. After every record is processed, you need to ensure that all the values (the integer ones) are grouped together for each key, which is the airline carrier code, and then sorted by key. In summary, as mapper task processes each record, it generates a new key/value pair: (K2,V2). The key and the value here can be completely different from the input pair. The output of the mapper task is the full collection of all these key/value pairs. In our case, the output is represented by list(K2,V2).

Source: “Hadoop for Dummies”

Shuffle phase

After the Map phase and before the beginning of the Reduce phase is a handoff process, known as shuffle and sort. Here, data from the mapper tasks is prepared and moved to the nodes where the reducer tasks will be run. When the mapper task is complete, the results are sorted by key, partitioned

if there are multiple reducers, and then written to disk. You can see this concept in picture below, which shows the MapReduce data processing flow and its interaction with the physical components of the Hadoop cluster.



**************************************************************************************


Reduce phase

Here’s the blow-by-blow so far: A large data set has been broken down into smaller pieces, called input splits, and individual instances of mapper tasks have processed each one of them. In some cases, this single phase of processing is all that’s needed to generate the desired application output. For example, if you’re running a basic transformation operation on the data — converting all text to uppercase, for example, or extracting key frames from video files — the lone phase is all you need. (This is known as a map-only job, by the way.) But in many other cases, the job is only half-done when the mapper tasks have written their output. The remaining task is boiling down all interim results to a single, unified answer.

The Reduce phase processes the keys and their individual lists of values so that what’s normally returned to the client application is a set of key/value pairs. Similar to the mapper task, which processes each record one-by-one, the reducer processes each key individually. Back in earlier picture, you see this concept represented as K2,list(V2). The whole Reduce phase returns list(K3,V3). Normally, the reducer returns a single key/value pair for every key it processes. However, these key/value pairs can be as expansive or as small as you need them to be. In the code example later, you see a minimalist case, with a simple key/value pair with one airline code and the corresponding total number of flights completed. But in practice, you could expand the sample to return a nested set of values where, for example, you return a breakdown of the number of flights per month for every airline code.


**************************************************************************************

Finally, a brief summary:

Distributed Computing, Divide and conquer & Parallelism are the key for big data platforms.

Distributed computing:

  • Using cluster computing framework, cluster of inexpensive compute nodes

Divide & conquer:

  • Store large amount of data in cluster of commodity HW using HDFS
  • Achieve cost, achieve scale – large amount of data, inexpensive HW

Parallelism:

  • Deploy functional programming (Map/Reduce, Filter/Reduce..) on the cluster
  • Achieve speed – process large amount of data in parallel @ speed
  • Achieve cost, achieve scale

?

?

References:

Map & Reduce – the Hadoop way, Source: Internet



?

Arupam Sengupta

Senior Technical Lead at EY GDS

11 个月

Highly informative article.

回复
Mrutyunjaya (Jay) Hota

An avid learner || Intelligent Automation Enthusiastic || Agile Practitioner

11 个月

A great start Manik da, wish to see more blogs in future.

Rashmita Parija

HR Leader @ Lexmark | DEI Advocate | CSR Champion | Speaker

11 个月

Congratulations on your 1st blog post!! Wishing you many more and keep spreading the knowledge….????

Kaushik Majumder

Software engineer | Full stack senior developer | Cloud Development

11 个月

Extremely eloquent and very rich content!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了