Data at Scale: Business & Technology Mapping
Any business, whose technology foundation is built on rigid tightly coupled systems cannot scale at will as their online users or transaction increases exponentially. Most of the Products built for large enterprises, B2B, B2C, Consumer service businesses (such as Apps for Cabs, food delivery) etc. face scalability challenges as they grow their consumer base. Social Media and Audio/Video streaming companies today have billions of active users, billions of likes, millions of photos uploaded ~ 500 million photos in their site, millions of audios/videos streamed per day. Petabytes of data is generated on daily basis based on the devices, user behavior, location and specific user actions that brings up the challenge of scalability. Social Media sites stores everything from your stickers to your login location approx. about 800 Terabytes of data uploaded daily in their site. Search engines offers an option to download all of the data it stores about you. Search Engines knows everything you’ve ever searched – and deleted. GPS and Maps services, knows where you’ve been. The data these companies have on you can fill millions of Word documents. It’s all about Scale, Analytics on Petabytes of data - Real time processing of ingested data joined with persisted data and Background processing on massive store of data to make meaningful business out of it. Most of the technologies available advance technology are built to support Replication, High Availability, Fault Tolerant, Data Sharding and they have mechanism to overcome single point of failure. However, most important point to consider while evaluating a technology for scale are: ease of maintenance, deployment, configuration, data performance and ability to expand clusters as business grows by simply adding a new cluster and nodes. Vertical scalability involves moving from one machine to another that has more capacity - whether that's RAM, CPU, storage or some combination. Horizontal scalability allows for seamless addition of nodes using commodity servers. The technology that meets some of these requirements are obvious choice by engineering community and businesses.
Processing Data using Parallel Jobs running in distributed nodes
Yes, this is one of the major challenges with Big Data. In order to solve it, we move processing to data and not data to processing. Apache Hadoop is an open source software framework used to develop data processing applications which are executed in a distributed environment. HDFS provides a distributed way to store Big data in distributed nodes. The data is stored in blocks across the DataNodes with specific size of blocks. Basically, if you have 512MB of data and you have configured HDFS such that, it will create 128 MB of data blocks distributed and replicated across nodes. HDFS for storage (Hadoop distributed File System), that allows you to store data of various formats across a cluster. The second one is YARN, for resource management in Hadoop. It allows parallel processing over the data, i.e. stored across HDFS using Map Reduce jobs. As we all know, Managing and Administering Hadoop clusters is not simple, there are several vendors in the market who package Hadoop, has built in software to manage these clusters. There are several libraries and open source built around Hadoop including Hive, HBase , Pig, Oozie, Sqoop etc make life simpler.
Real time Data Analytics
Data streaming is another wave in the analytics and machine learning landscape as it assists organizations in quick decision-making through real-time analytics. Real time data is used in several fields today including threat detection in network, tracking the movement of trains in transit systems etc. Apache Storm is a must-have tool to consider among all the other options available today. Unlike Hadoop that carries out batch processing, Apache Storm is specifically built for transforming streams of data. However, it can be also used for online machine learning, ETL, among others. Its ability to process data faster than its competitors differentiates Apache Storm in carrying out processes at the nodes. Apache Kafka? is a distributed streaming platform for Publishing and subscribe to streams of records, similar to a message queue or enterprise messaging system. It stores streams of records in a fault-tolerant way and process stream of data as they occur. Kafka is run as a cluster on one or more servers that can span multiple datacenters. The Kafka cluster stores streams of records in categories called topics. It can achieve high throughput (millions of messages per second) with limited resources whereas RabbitMQ would need close to 30 nodes to process a million message. One of the Social media company designed for professionals run on 1000’s of Kafka broker organized into 50+ Clusters. Web Activity tracking, Log Aggregation, Stream Processing, Event sourcing etc are common use cases for Kafka.
Horizontal and Vertical Scaling using No SQL database for persistence
No SQL Database such as Apache Cassandra (uses wide column stores) meets the requirements of an ideal horizontally scalable system by allowing for seamless addition of nodes using commodity servers, supports replication, high availability, fault tolerant from failures, partitioning of data in multiple datacenter clusters and sharding. Schema less or ability to change the schema to store any form of information makes No SQL database perfect for storing structured/unstructured data. Similarly, MongoDB is schema less and uses JSON Documents, unlike RDBMS these no SQL databases are extremely flexible. For e.g. in Cassandra tables can be created, deleted and dropped while database is running. Columns can be added on the fly with every write. Some of the largest production deployments include Apple's, with over 75,000 nodes storing over 10 PB of data, Netflix (2,500 nodes, 420 TB, over 1 trillion requests per day)
Data Insights and Visualization
Visualize and Act on real time Data using Technologies such as ELK (Elastic Logstash and Kibana). From finance and behavior analytics to usage monitoring and marketing content performance, get deeper insights into your data with the Elastic Stack, a scalable cross-platform analytics engine. Customize dashboards line charts, maps, and more to identify high-performing regions, analyze sales funnels, and improve website analytical functionality-all in real time. Elasticsearch is a search and analytics engine. Logstash is a server?side data processing pipeline that ingests data from multiple sources in parallel, transforms it, and then sends it to a "stash" like Elasticsearch. Using Kibana, users visualize data with charts and graphs in Elasticsearch. Pushing all the log data in cloud and analyzing them using ELK framework is a very common use case in modern cloud-based applications.
Rise of Edge Computing
With deployments of IoT devices and 5G fast wireless, compute and analytics closer to where data is created is making the case for edge computing. This is done so that data, especially real-time data, does not suffer latency issues that can affect an application’s performance. Devices that monitor manufacturing equipment on a factory floor, or an internet-connected video camera that sends live footage from a remote office, Autonomous cars that needs to make a real time decision to brake etc. Autonomous cars might require a communicate to DataGrid for instant decision making, similar to mobile tower technology, here the cars communicate to DataGrid while on the move. An edge gateway, for example, can process data from an edge device, and then send only the relevant data back through the cloud, reducing bandwidth needs. From a security standpoint, data at the edge can be vulnerable, especially when it’s being handled by different devices that might not be as secure as a centralized or cloud-based system. Applications using 5G technology will change traffic demand patterns, and with its much higher network speed provide the biggest driver for edge computing in mobile cellular network.
Rise of Quantum Computing
In classical computing, a computer runs on bits that have a value of either 0 or 1. Quantum bits, or “qubits,” are similar, in that for practical purposes, we read them as a value of 0 or 1, but they can also hold much more complex information, or even be negative values (10%=-0.23, 15%=0, 60%=1 etc). Quantum Computing will enable the “fifth generation” of computers. Google announced it has a quantum computer that is 100 million times faster than any classical computer in its lab. Another company recently announced its next-generation quantum computer with 2,000 qubits. It can be applied in the field of Protein folding and drug discovery, Portfolio risk optimization etc. However, it also opens up a threat, that well-known encryption algorithm used in SSL/TLS can be broken within minutes. The Indian government in its budget 2020 has announced a National Mission on Quantum Technologies & Applications (NM-QTA) with a total budget outlay of Rs 8000 Crore for a period of five years to be implemented by the Department of Science & Technology (DST).
Building Acceldata
5 年Nicely written... Ashwin Rajeeva?do check out.