WHAT ARE THE 7 V’S OF BIG DATA?

WHAT ARE THE 7 V’S OF BIG DATA?

Big data” is high-volume, velocity, and variety information asset that demands cost effective and innovative forms of information processing for enhanced insight and decision making.”

Big Data is of three types:

·        Structured Data – It is that data that can be processed, stored, and can be retrieved in a fixed format i.e., we can say that when data is stored and extracted, it should be in a proper manner/layout.

·        Unstructured Data – This type of data lacks any structure and is stored as it is. Analyzing such data is very time-consuming as well as challenging.

·        Semi-structured Data – This type of data is a mix of the above two types of big data i.e., structured and unstructured data. It is also known as hybrid big data

 The mass volume, variety and velocity of information nowadays make it indispensable to capture, store and analyze this complex gear assembly. That is why Big Data is characterized by “five Vs”:

  •  Volume

Data stored in company repositories have gone from taking megabytes to taking gigabytes and then petabytes of space. Ninety percent of existing information was created in the last two years. To give you an idea: in 2008, Google was processing over 20 petabytes of data per day!

By 2020, it is estimated that 40 zettabytes of data will be processed across the world, and the amount of data in the world should double every two years. A major contributor to this data volume is the Internet of Things (IoT), which retrieves an immense amount of information via sensors.

  •  Velocity

The velocity of data movement, processing and capture within and outside companies has increased significantly. Models based on business intelligence normally take days to be processed, while today’s analytical needs require that data be captured and processed “practically” in real time thanks to the high-speed flow of data.

Data velocity in almost real time derives from the ubiquity and availability of devices connected to the internet, both wireless and wired. Information is currently transmitted at extraordinary speed. For example, it is estimated that 500 hours of video are uploaded to YouTube per minute and that 200 million emails are sent in the same period of time.

  •  Variety

Data diversity have burgeoned, going from stored and structured data kept in business databases to unstructured data, semi-structured data and data in different formats (audio, video, XMLs, etc.). For example, over 3.5 million people make calls, send SMS, tweet and browse the internet from their cell phones.

Estimations indicate that 90% of today’s data are generated in an unstructured manner. And not every analysis method can be used with every data, consequently, these methods must adjust to the nature of the information.

  •  Veracity

The aim is to promote the search for data veracity so that we may retrieve reliable information. Accurate data allow for greater utilization because of their quality. This is particularly important for organizations whose business is centered on information.

However, given the existing amount of information, some people believe that veracity is a secondary characteristic of Big Data.

  •  Value

The return resulting from data management. The key to Big Data is not the countless amounts of information but rather how it is used and/or handled. Even though it is very expensive to implement IT infrastructures to handle large volumes of data, this implementation may offer companies major competitive advantages.

A common reference when you speak of Big Data’s value is the number of people connected to the internet around the world, 3.149 billion of hyper-connected users – a pocket of data whose return in many sectors is still to be estimated.

Two additional Vs

In addition to the Vs mentioned above, some experts suggest other aspects, defends that variability and visualization should be added to the 5 Vs:

Variability refers to variability in meaning. This is important when you analyze perceptions. Algorithms must be able to understand the context and decode the exact meaning of every word in its specific environment. This is a much more complex analysis.

Visualization means making the collected and analyzed data understandable and easy to read. Without the right visualization, it is impossible to maximize and leverage raw data.

No alt text provided for this image


READ MORE:Data Science and its Relationship to Big Data and Data-Driven Decision Making

Advantages of Big Data        

Big Data can help create pioneering breakthroughs for organizations that know how to use it correctly. Big Data solutions and Big Data Analytics can not only foster data-driven decision making, but they also empower your workforce in ways that add value to your business.

1.Cost optimization

 One of the most significant benefits of Big Data tools like Hadoop and Spark is that these offer cost advantages to businesses when it comes to storing, processing, and analyzing large amounts of data. Not just that, Big Data tools can also identify efficient and cost-savvy ways of doing business.

 The logistics industry presents an excellent example to highlight the cost-reduction benefit of Big Data. Usually, the cost of product returns is 1.5 times greater that of actual shipping costs. Big Data Analytics allows companies to minimize product return costs by predicting the likelihood of product returns. They can estimate which products are most likely to be returned, thereby allowing companies to take suitable measures to reduce losses on returns. 

2. Improve efficiency

 Big Data tools can improve operational efficiency by leaps and bounds. By interacting with customers/clients and gaining their valuable feedback, Big Data tools can amass large amounts of useful customer data. This data can then be analyzed and interpreted to extract meaningful patterns hidden within (customer taste and preferences, pain points, buying behavior, etc.), which allows companies to create personalized products/services. 

 Big Data Analytics can identify and analyze the latest market trends, allowing you to keep pace with your competitors in the market. Another benefit of Big Data tools is that they can automate routine processes and tasks. This frees up the valuable time of human employees, which they can devote to tasks that require cognitive skills.  

3. Foster competitive pricing

 Big Data Analytics facilitates real-time monitoring of the market and your competitors. You can not only keep track of the past actions of your competitors but also see what strategies they are adopting now. Big Data Analytics offers real-time insights that allow you to – 

  • Calculate and measure the impact of price changes.
  • Implement competitive positioning for maximizing company profits. 
  • Evaluate finances to get a clearer idea of the financial position of your business.
  • Implement pricing strategies based on local customer demands, customer purchasing behavior, and competitive market patterns.
  • Automate the pricing process of your business to maintain price consistency and eliminate manual errors. 

4. Boost sales and retain customer loyalty


 Big Data aims to gather and analyze vast volumes of customer data. The digital footprints that customers leave behind reveal a great deal about their preferences, needs, buying behavior, and much more. This customer data offers the scope to design tailor-made products and services to cater to the specific needs of individual customer segments. The higher the personalization quotient of a business, the more it will attract customers. Naturally, this will boost sales considerably. 

 Personalization and the quality of product/service also have a positive impact on customer loyalty. If you offer quality products at competitive prices along with personalized features/discounts, customers will keep coming back to you time and again. 

5. Innovate 

 Big Data Analytics and tools can dig into vast datasets to extract valuable insights, which can be transformed into actionable business strategies and decisions. These insights are the key to innovation. 

The insights you gain can be used to tweak business strategies, develop new products/services (that can address specific problems of customers), improve marketing techniques, optimize customer service, improve employee productivity, and find radical ways to expand brand outreach. 

6. Focus on the local environment

This is particularly relevant for small businesses that cater to the local market and its customers. Even if your business functions within a constrained setting, it is essential to understand your competitors, what they are offering, and the customers.  

 Big Data tools can scan and analyze the local market and offer insights that allow you to see the local trends associated with sellers and customers. Consequently, you can leverage such insights to gain a competitive edge in the local market by delivering highly personalized products/services within your niche, local environment. 

7. Control and monitor online reputation

 As an increasing number of businesses are shifting towards the online domain, it has become increasingly crucial for companies to check, monitor, and improve their online reputation. After all, what customers are saying about you on various online and social media platforms can affect how your potential customers will view your brand. 

 There are numerous Big Data tools explicitly designed for sentiment analysis. These tools help you surf the vast online sphere to find out and understand what people are saying about your products/services and your brand. When you are able to understand customer grievances, only then can you work to improve your services, which will ultimately improve your online reputation. 

 To conclude, Big Data has emerged as a highly powerful tool for businesses, irrespective of their size, and the industry they are a part of. The biggest advantage of Big Data is the fact that it opens up new possibilities for organizations. Improved operational efficiency, improved customer satisfaction, drive for innovation, and maximizing profits are only a few among the many, many benefits of Big Data. Despite the proven benefits of Big Data we’ve witnessed so far, it still holds numerous untapped possibilities that are waiting to be explored. 

Big Data Tools and Technologies

Hadoop Ecosystem

You can’t possibly talk about Big Data without mentioning the elephant in the room (pun intended!) – Hadoop. An acronym for ‘High-availability distributed object-oriented platform”, Hadoop is essentially a framework used for maintaining, self-healing, error handling, and securing large datasets. However, over the years, Hadoop has encompassed an entire ecosystem of related tools. Not only that, most commercial Big Data solutions are based on Hadoop.

No alt text provided for this image


HDFS

It stands for Hadoop Distributed Filesystem. It can be thought of as the file storage system for Hadoop. HDFS deals with distribution and storage of large datasets.

MapReduce

MapReduce allows massive datasets to be processed rapidly in parallel. It follows a simple idea – to deal with a lot of data in a very little time, simply employ more workers for the job. A typical MapReduce job is processed in two phases: Map and Reduce. The “Map” phase sends a query for processing to various nodes in a Hadoop cluster, and the “Reduce” phase collects all the results to output into a single value. MapReduce takes care of scheduling jobs, monitoring jobs, and re-executing the failed task.

Hive

Hive is a data warehousing tool which converts query language into MapReduce commands. It was initiated by Facebook. The best part about using Hive is that developers can use their existing SQL knowledge since Hive uses HQL (Hive Query Language) which has a syntax similar to the classic SQL.


Spark

Apache Spark deserves a special mention on this list as it is the fastest engine for Big Data processing. It’s put to use by major players including Amazon, Yahoo!, eBay, and Flipkart. Take a look at all the organisations that are powered by Spark, and you will be blown away!

Spark has in many ways outdated Hadoop as it lets you run programs up to a hundred times faster in-memory, and ten times faster on disk.

It complements the intentions with which Hadoop was introduced. When dealing with large datasets, one of the major concerns is processing speed, so, there was a need to diminish the waiting time between the execution of each query. And Spark does exactly that – thanks to its built-in modules for streaming, graph processing, machine learning, and SQL support. It also supports the most common programming languages – Java, Python, and Scala.

The main motive behind introducing Spark was to speed up the computational processes of Hadoop. However, it should not be seen as an extension of the latter. In fact, Spark uses Hadoop for two main purposes only — storage and processing. Other than that, it’s a pretty standalone tool.

Hadoop consists of three core components –

  • Hadoop Distributed File System (HDFS) – It is the storage layer of Hadoop.
  • Map-Reduce – It is the data processing layer of Hadoop.
  • YARN – It is the resource management layer of Hadoop.

Core Components of Hadoop

Let us understand these Hadoop components in detail.

1. HDFS

Short for Hadoop Distributed File System provides for distributed storage for Hadoop. HDFS has a master-slave topology.

No alt text provided for this image


Master is a high-end machine where as slaves are inexpensive computers. The Big Data files get divided into the number of blocks. Hadoop stores these blocks in a distributed fashion on the cluster of slave nodes. On the master, we have metadata stored.

HDFS has two daemons running for it. They are :

NameNode : NameNode performs following functions –

  • NameNode Daemon runs on the master machine.
  • It is responsible for maintaining, monitoring and managing DataNodes.
  • It records the metadata of the files like the location of blocks, file size, permission, hierarchy etc.
  • Namenode captures all the changes to the metadata like deletion, creation and renaming of the file in edit logs.
  • It regularly receives heartbeat and block reports from the DataNodes.

DataNode: The various functions of DataNode are as follows –

  • DataNode runs on the slave machine.
  • It stores the actual business data.
  • It serves the read-write request from the user.
  • DataNode does the ground work of creating, replicating and deleting the blocks on the command of NameNode.
  • After every 3 seconds, by default, it sends heartbeat to NameNode reporting the health of HDFS.


2. MapReduce 

It is the data processing layer of Hadoop. It processes data in two phases.

They are:-

Map Phase- This phase applies business logic to the data. The input data gets converted into key-value pairs.

Reduce Phase- The Reduce phase takes as input the output of Map Phase. It applies aggregation based on the key of the key-value pairs.

No alt text provided for this image


Map-Reduce works in the following way:

  • The client specifies the file for input to the Map function. It splits it into tuples
  • Map function defines key and value from the input file. The output of the map function is this key-value pair.
  • MapReduce framework sorts the key-value pair from map function.
  • The framework merges the tuples having the same key together.
  • The reducers get these merged key-value pairs as input.
  • Reducer applies aggregate functions on key-value pair.
  • The output from the reducer gets written to HDFS.

Features of MapReduce

  • Simplicity – MapReduce jobs are easy to run. Applications can be written in any language such as java, C++, and python.
  • Scalability – MapReduce can process petabytes of data.
  • Speed – By means of parallel processing problems that take days to solve, it is solved in hours and minutes by MapReduce.
  • Fault Tolerance – MapReduce takes care of failures. If one copy of data is unavailable, another machine has a copy of the same key pair which can be used for solving the same subtask.

3. YARN

Short for Yet Another Resource Locator has the following components:-

No alt text provided for this image

Resource Manager

  • Resource Manager runs on the master node.
  • It knows where the location of slaves (Rack Awareness).
  • It is aware about how much resources each slave have.
  • Resource Scheduler is one of the important service run by the Resource Manager.
  • Resource Scheduler decides how the resources get assigned to various tasks.
  • Application Manager is one more service run by Resource Manager.
  • Application Manager negotiates the first container for an application.
  • Resource Manager keeps track of the heart beats from the Node Manager.

Node Manager

  • It runs on slave machines.
  • It manages containers. Containers are nothing but a fraction of Node Manager’s resource capacity
  • Node manager monitors resource utilization of each container.
  • It sends heartbeat to Resource Manager.

Job Submitter

The application startup process is as follows:-

  • The client submits the job to Resource Manager.
  • Resource Manager contacts Resource Scheduler and allocates container.
  • Now Resource Manager contacts the relevant Node Manager to launch the container.
  • Container runs Application Master.

The basic idea of YARN was to split the task of resource management and job scheduling. It has one global Resource Manager and per-application Application Master. An application can be either one job or DAG of jobs.

The Resource Manager’s job is to assign resources to various competing applications. Node Manager runs on the slave nodes. It is responsible for containers, monitoring resource utilization and informing about the same to Resource Manager.

The job of Application master is to negotiate resources from the Resource Manager. It also works with NodeManager to execute and monitor the tasks.

Main features of YARN are:

  • Flexibility – Enables other purpose-built data processing models beyond MapReduce (batch), such as interactive and streaming. Due to this feature of YARN, other applications can also be run along with Map Reduce programs in Hadoop2.
  • Efficiency – As many applications run on the same cluster, Hence, efficiency of Hadoop increases without much effect on quality of service.
  • Shared – Provides a stable, reliable, secure foundation and shared operational services across multiple workloads. Additional programming models such as graph processing and iterative modeling are now possible for data processing


Hive

The Hadoop ecosystem component, Apache Hive, is an open source data warehouse system for querying and analyzing large datasets stored in Hadoop files. Hive do three main functions: data summarization, query, and analysis.

Hive use language called HiveQL (HQL), which is similar to SQL. HiveQL automatically translates SQL-like queries into MapRduce jobs which will execute on Hadoop.

No alt text provided for this image


Main parts of Hive are:

  • Metastore – It stores the metadata.
  • Driver – Manage the lifecycle of a HiveQL statement.
  • Query compiler – Compiles HiveQL into Directed Acyclic Graph(DAG).
  • Hive server – Provide a thrift interface and JDBC/ODBC server



要查看或添加评论,请登录

社区洞察

其他会员也浏览了