How Companies Manages BigData.
Data has become one of the most important business assets and a company without a proper strategy is likely to crumble due to obsolete data resources.Data has been around us even before the advent of computers and proper database management systems.However, as technology became the driver of the engine that we are living in, data became it’s fuel. Therefore, it became necessary to ensure efficient storage, management and manipulation of large amount of data.
What is BigData??
Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.
Types Of Big Data
BigData could be found in three forms:
Structured- Any data that can be stored, accessed and processed in the form of fixed format is termed as a 'structured' data. Over the period of time, talent in computer science has achieved greater success in developing techniques for working with such kind of data (where the format is well known in advance) and also deriving value out of it. However, nowadays, we are foreseeing issues when a size of such data grows to a huge extent, typical sizes are being in the rage of multiple zettabytes.
Unstructured- Any data with unknown form or the structure is classified as unstructured data. In addition to the size being huge, un-structured data poses multiple challenges in terms of its processing for deriving value out of it. A typical example of unstructured data is a heterogeneous data source containing a combination of simple text files, images, videos etc. Now day organizations have wealth of data available with them but unfortunately, they don't know how to derive value out of it since this data is in its raw form or unstructured format.
Semi-structured- Semi-structured data can contain both the forms of data. We can see semi-structured data as a structured in form but it is actually not defined with e.g. a table definition in relational DBMS. Example of semi-structured data is a data represented in an XML file.
Challenges faced by BigData??
Data volumes are continuing to grow and so are the possibilities of what can be done with so much raw data available. However, organizations need to be able to know just what they can do with that data and how much they can leverage to build insights for their consumers, products, and services. Of the 85% of companies using Big Data, only 37% have been successful in data-driven insights. A 10% increase in the accessibility of the data can lead to an increase of $65Mn in the net income of a company.
Some of the commonly faced issues include inadequate knowledge about the technologies involved, data privacy, and inadequate analytical capabilities of organizations. A lot of enterprises also face the issue of a lack of skills for dealing with Big Data technologies. Not many people are actually trained to work with Big Data, which then becomes an even bigger problem.
1.Handling a Large Amount of Data
There is a huge explosion in the data available. Look back a few years, and compare it with today, and you will see that there has been an exponential increase in the data that enterprises can access. They have data for everything, right from what a consumer likes, to how they react, to a particular scent, to the amazing restaurant that opened up in Italy last weekend.
This data exceeds the amount of data that can be stored and computed, as well as retrieved. The challenge is not so much the availability, but the management of this data. With statistics claiming that data would increase 6.6 times the distance between earth and moon by 2020, this is definitely a challenge.
Along with rise in unstructured data, there has also been a rise in the number of data formats. Video, audio, social media, smart device data etc. are just a few to name.
Some of the newest ways developed to manage this data are a hybrid of relational databases combined with NoSQL databases. An example of this is MongoDB, which is an inherent part of the MEAN stack. There are also distributed computing systems like Hadoop to help manage Big Data volumes.
Netflix is a content streaming platform based on Node.js. With the increased load of content and the complex formats available on the platform, they needed a stack that could handle the storage and retrieval of the data. They used the MEAN stack, and with a relational database model, they could in fact manage the data.
2. Real-time can be Complex
When I say data, I’m not limiting this to the “stagnant” data available at common disposal. A lot of data keeps updating every second, and organizations need to be aware of that too. For instance, if a retail company wants to analyze customer behavior, real-time data from their current purchases can help. There are Data Analysis tools available for the same – Veracity and Velocity. They come with ETL engines, visualization, computation engines, frameworks and other necessary inputs.
It is important for businesses to keep themselves updated with this data, along with the “stagnant” and always available data. This will help build better insights and enhance decision-making capabilities.
However, not all organizations are able to keep up with real-time data, as they are not updated with the evolving nature of the tools and technologies needed. Currently, there are a few reliable tools, though many still lack the necessary sophistication.
3. Data Security
A lot of organizations claim that they face trouble with Data Security. This happens to be a bigger challenge for them than many other data-related problems. The data that comes into enterprises is made available from a wide range of sources, some of which cannot be trusted to be secure and compliant within organizational standards.
They need to use a variety of data collection strategies to keep up with data needs. This in turn leads to inconsistencies in the data, and then the outcomes of the analysis. A simple example such as annual turnover for the retail industry can be different if analyzed from different sources of input. A business will need to adjust the differences, and narrow it down to an answer that is valid and interesting.
This data is made available from numerous sources, and therefore has potential security problems. You may never know which channel of data is compromised, thus compromising the security of the data available in the organization, and giving hackers a chance to move in.
It’s necessary to introduce Data Security best practices for secure data collection, storage and retrieval.
4. Shortage of Skilled People
There is a definite shortage of skilled Big Data professionals available at this time. This has been mentioned by many enterprises seeking to better utilize Big Data and build more effective Data Analysis systems. There is a lack experienced people and certified Data Scientists or Data Analysts available at present, which makes the “number crunching” difficult, and insight building slow.
Again, training people at entry level can be expensive for a company dealing with new technologies. Many are instead working on automation solutions involving Machine Learning and Artificial Intelligence to build insights, but this also takes well-trained staff or the outsourcing of skilled developers.
Big Companies like FACEBOOK stores how much data??
Facebook revealed some big, big stats on big data to a few reporters at its HQ today, including that its system processes 2.5 billion pieces of content and 500+ terabytes of data each day. It’s pulling in 2.7 billion Like actions and 300 million photos per day, and it scans roughly 105 terabytes of data each half hour.
Another stat Facebook revealed was that over 100 petebytes of data are stored in a single Hadoop disk cluster.The speed of ingestion keeps on increasing, and “the world is getting hungrier and hungrier for data.”Right now Facebook actually stores its entire live, evolving user database in a single data center, with others used for redundancy and other data. When the main chunk gets too big for one data center it has to move the whole thing to another that’s been expanded to fit it. This shuttling around is a waste of resources.
What are the Technologies used to manage BigData of fACEBOOK??
There is a combined workforce of people and technology constantly working behind the successful implementation of this platform. Though the platform is continuously being enriched, below are the prime technological aspects:
1.Hadoop
“Facebook runs the world’s largest Hadoop cluster" says Jay Parikh, Vice President Infrastructure Engineering, Facebook.
Basically, Facebook runs the biggest Hadoop cluster that goes beyond 4,000 machines and storing more than hundreds of millions of gigabytes. This extensive cluster provides some key abilities to developers:
- The developers can freely write map-reduce programs in any language.
- SQL has been integrated to process extensive data sets, as most of the data in Hadoop’s file system are in table format. Hence, it becomes easily accessible to the developers with small subsets of SQL.
Hadoop provides a common infrastructure for Facebook with efficiency and reliability. Beginning with searching, log processing, recommendation system, and data warehousing, to video and image analysis, Hadoop is empowering this social networking platform in each and every way possible. Facebook developed its first user-facing application, Facebook Messenger, based on Hadoop database, i.e., Apache HBase, which has a layered architecture that supports plethora of messages in a single day.
When To Use Hadoop:
- For Processing Really BIG Data: ...
- For Storing a Diverse Set of Data: ...
- For Parallel Data Processing: ...
- For Real-Time Data Analysis: ...
- For a Relational Database System: ...
- For a General Network File System: ...
- For Non-Parallel Data Processing: ...
- Hadoop Distributed File System (HDFS)
2.Scuba
With a huge amount of unstructured data coming across each day, Facebook slowly realized that it needs a platform to speed up the entire analysis part. That’s when it developed Scuba, which could help the Hadoop developers dive into the massive data sets and carry on ad-hoc analyses in real-time.
Facebook was not initially prepared to run across multiple data centers and a single break-down could cause the entire platform to crash. Scuba, another Big data platform, allows the developers to store bulk in-memory data, which speeds up the informational analysis. It implements small software agents that collect the data from multiple data centers and compresses it into the log data format. Now this compressed log data gets compressed by Scuba into the memory systems which are instantly accessible.
According to Jay Parikh, “Scuba gives us this very dynamic view into how our infrastructure is doing — how our servers are doing, how our network is doing, how the different software systems are interacting.”
3.Cassandra
“The amount of data to be stored, the rate of growth of the data, and the requirement to serve it within strict SLAs made it very apparent that a new storage solution was absolutely essential.”
- Avinash Lakshman, Search Team, Facebook
The traditional data storage started lagging behind when Facebook's search team discovered an Inbox Search problem. The developers were facing issues in storing the reverse indices of messages sent and received by the users. The challenge was to develop a new storage solution that could solve the Inbox Search Problem and similar problems in the future. That is when Prashant Malik and Avinash Lakshman started developing Cassandra.
The objective was to develop a distributed storage system dedicated to managing a large amount of structured data across multiple commodity servers without failing once.
4.Hive
After Yahoo implemented Hadoop for its search engine, Facebook thought about empowering the data scientists so that they could store a larger amount of data in the Oracle data warehouse. Hence, Hive came into existence. This tool improved the query capability of Hadoop by using a subset of SQL and soon gained popularity in the unstructured world. Today almost thousands of jobs are run using this system to process a range of applications quickly.
5.Prism
Hadoop wasn’t designed to run across multiple facilities. Typically, because it requires such heavy communication between servers, clusters are limited to a single data center.
Initially when Facebook implemented Hadoop, it was not designed to run across multiple data centers. And that’s when the requirement to develop Prism was felt by the team of Facebook. Prism is a platform which brings out many namespaces instead of the single one governed by the Hadoop. This in turn helps to develop many logical clusters.
This system is now expandable to as many servers as possible without worrying about increasing the number of data centers.
6.Corona
Developed by an ex-Yahoo man Avery Ching and his team, Corona allows multiple jobs to be processed at a time on a single Hadoop cluster without crashing the system. This concept of Corona sprouted in the minds of developers, when they started facing issues with Hadoop’s framework. It was getting tougher to manage the cluster resources and task trackers. MapReduce was designed on the basis of a pull-based scheduling model, which was causing a delay in processing the small jobs. Hadoop was limited by its slot-based resource management model, which was wasting the slots each time the cluster size could not fit the configuration.
Developing and implementing Corona helped in forming a new scheduling framework that could separate the cluster resource management from job coordination.
7.Peregrine
Another technological tool that is developed by Murthy was Peregrine, which is dedicated to addressing the issues of querying data as quickly as possible. Since Hadoop was developed as a batch system that used to take time in running different jobs, Peregrine brought the entire process close to real-time.
Apart from the above prime implementations, Facebook uses many other small and big sized pieces of technology to support its Big Data infrastructure, such as Memcached, Hiphop for PHP, Haystack, Bigpipe, Scribe, Thrift, Varnish, etc.
Today Facebook is one of the biggest corporations on earth thanks to its extensive data on over one and a half billion people on earth. This has given it enough clout to negotiate with over 3 million advertisers on its platform in order to clock staggering revenues that is north of 17 Billion US Dollars. But the privacy and security concerns still loom large regarding whether Facebook will utilize all that gargantuan volumes of data to server humanity’s greater good or just use it to make more money.
What is Distributed Storage?
A distributed storage system is infrastructure that can split data across multiple physical servers, and often across more than one data center. It typically takes the form of a cluster of storage units, with a mechanism for data synchronization and coordination between cluster nodes.
Distributed storage is the basis for massively scalable cloud storage systems like Amazon S3 and Microsoft Azure Blob Storage, as well as on-premise distributed storage systems like Cloudian Hyperstore.
Distributed storage systems can store several types of data:
- Files—a distributed file system allows devices to mount a virtual drive, with the actual files distributed across several machines.
- Block storage—a block storage system stores data in volumes known as blocks. This is an alternative to a file-based structure that provides higher performance. A common distributed block storage system is a Storage Area Network (SAN).
- Objects—a distributed object storage system wraps data into objects, identified by a unique ID or hash.
Distributed storage systems have several advantages:
- Scalability—the primary motivation for distributing storage is to scale horizontally, adding more storage space by adding more storage nodes to the cluster.
- Redundancy—distributed storage systems can store more than one copy of the same data, for high availability, backup, and disaster recovery purposes.
- Cost—distributed storage makes it possible to use cheaper, commodity hardware to store large volumes of data at low cost.
- Performance—distributed storage can offer better performance than a single server in some scenarios, for example, it can store data closer to its consumers, or enable massively parallel access to large files.
Distributed Storage Features and Limitations
Most distributed storage systems have some or all of the following features:
- Partitioning—the ability to distribute data between cluster nodes and enable clients to seamlessly retrieve the data from multiple nodes.
- Replication—the ability to replicate the same data item across multiple cluster nodes and maintain consistency of the data as clients update it.
- Fault tolerance—the ability to retain availability to data even when one or more nodes in the distributed storage cluster goes down.
- Elastic scalability—enabling data users to receive more storage space if needed, and enabling storage system operators to scale the storage system up and down by adding or removing storage units to the cluster.
Conclusion: This is all about big data and how do companies manages this many data everyday.
I would like to thank Mr. Vimal Daga sir for giving us opportunity to learn this technologies.
Sr DevOps at barq | 2xAWS Certified | 3xAzure Certified | MLOps | Terraform | Ansible | Jenkins | ArgoCD | CloudComputing | K8s | Blogger
4 年Nice dude
Open for Contracts
4 年Good one.
DevOps Engineer | RHCSA | AWS Solutions Architect-Associate | 3+ Years Experience
4 年Great work Shivam Agarwal
Open to opportunities
4 年Nice work mate!