Big Data
In todays modern digital era everyone uses any kind of networking platform which generates enormous amount of data per second . 1.7MB of data is created every second by every person during 2020. Its necessary to handle,store such huge amount of data generated by all of us. Here is one of the problem of big data is explained and how big companies like google,facebook ,microsoft manipulates their data is narrated .
What is big data ?
>>Big data is a term for massive data sets having large, more varied and complex structure with the difficulties of storing, analyzing and visualizing for further processes or results. This data can be generated from online transcations,audios,videos, social media and many more....
Defining big data via three vs:
1.Volume: Large volumes of digital data are generated continuously from millions of devices and applications (ICTs, smartphones, products codes, social networks, sensors, logs, etc.). The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data. Hence, 'Volume' is one characteristic which needs to be considered while dealing with Big Data.
2.Velocity: Data are generated in a fast way and should be processed rapidly to extract useful information and relevant insights. Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks, and social media sites, sensors,mobile devices, etc. The flow of data is massive and continuous.
3.Variety: Big Data are generated from distributed various sources and in multiple formats (e.g., videos, documents, comments, logs). Large data sets consist of structured and unstructured data, public or private, local or distant, shared or confidential, complete or incomplete, etc.
Types of big data
1.Structured data : Any data that can be stored, accessed and processed in the form of fixed format is termed as a 'structured' data. Over the period of time, talent in computer science has achieved greater success in developing techniques for working with such kind of data (where the format is well known in advance) and also deriving value out of it. However, nowadays, we are foreseeing issues when a size of such data grows to a huge extent, typical sizes are being in the rage of multiple zettabytes.
2.Unstructured data:Any data with unknown form or the structure is classified as unstructured data. In addition to the size being huge, un-structured data poses multiple challenges in terms of its processing for deriving value out of it. A typical example of unstructured data is a heterogeneous data source containing a combination of simple text files, images, videos etc. Now day organizations have wealth of data available with them but unfortunately, they don't know how to derive value out of it since this data is in its raw form or unstructured format.
3.Semi structured data:Semi-structured data can contain both the forms of data. We can see semi-structured data as a structured in form but it is actually not defined with e.g. a table definition in relational DBMS. Example of semi-structured data is a data represented in an XML file.
There are many ways to handle such a huge amount of data generated by many of companies one of it is given below :
Hadoop Distributed File System
Hadoop File System was developed using distributed file system design. It is run on commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant and designed using low-cost hardware.HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines. These files are stored in redundant fashion to rescue the system from possible data losses in case of failure. HDFS also makes applications available to parallel processing.
HDFS follows the master-slave architecture and it has the following elements.
>>Namenode :The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode software. It is a software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks ?
- Manages the file system namespace.
- Regulates client’s access to files.
- It also executes file system operations such as renaming, closing, and opening files and directories.
>>Datanode:The datanode is a commodity hardware having the GNU/Linux operating system and datanode software. For every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage the data storage of their system.
- Datanodes perform read-write operations on the file systems, as per client request.
- They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.
>>Block:Generally the user data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration.
Goals of HDFS:
Fault detection and recovery ? Since HDFS includes a large number of commodity hardware, failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery.
Huge datasets ? HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets.
Hardware at data ? A requested task can be done efficiently, when the computation takes place near the data. Especially where huge datasets are involved, it reduces the network traffic and increases the throughput.
Jr.Software Engineer@cognizant
4 年Thanks Trupti
Executive | L & T Greentech |Green EPC
4 年Great work
Assistant Professor at Indraprastha New Arts Commerce and Science College, Wardha
4 年Good going Namrata