登录查看更多内容

Data at Scale: Business & Technology Mapping

Niteen Lall

Director of Engineering

发布日期: 2020年2月19日

Any business, whose technology foundation is built on rigid tightly coupled systems cannot scale at will as their online users or transaction increases exponentially. Most of the Products built for large enterprises, B2B, B2C, Consumer service businesses (such as Apps for Cabs, food delivery) etc. face scalability challenges as they grow their consumer base. Social Media and Audio/Video streaming companies today have billions of active users, billions of likes, millions of photos uploaded ~ 500 million photos in their site, millions of audios/videos streamed per day. Petabytes of data is generated on daily basis based on the devices, user behavior, location and specific user actions that brings up the challenge of scalability. Social Media sites stores everything from your stickers to your login location approx. about 800 Terabytes of data uploaded daily in their site. Search engines offers an option to download all of the data it stores about you. Search Engines knows everything you’ve ever searched – and deleted. GPS and Maps services, knows where you’ve been. The data these companies have on you can fill millions of Word documents. It’s all about Scale, Analytics on Petabytes of data - Real time processing of ingested data joined with persisted data and Background processing on massive store of data to make meaningful business out of it. Most of the technologies available advance technology are built to support Replication, High Availability, Fault Tolerant, Data Sharding and they have mechanism to overcome single point of failure. However, most important point to consider while evaluating a technology for scale are: ease of maintenance, deployment, configuration, data performance and ability to expand clusters as business grows by simply adding a new cluster and nodes. Vertical scalability involves moving from one machine to another that has more capacity - whether that's RAM, CPU, storage or some combination. Horizontal scalability allows for seamless addition of nodes using commodity servers. The technology that meets some of these requirements are obvious choice by engineering community and businesses.

Processing Data using Parallel Jobs running in distributed nodes

Yes, this is one of the major challenges with Big Data. In order to solve it, we move processing to data and not data to processing. Apache Hadoop is an open source software framework used to develop data processing applications which are executed in a distributed environment. HDFS provides a distributed way to store Big data in distributed nodes. The data is stored in blocks across the DataNodes with specific size of blocks. Basically, if you have 512MB of data and you have configured HDFS such that, it will create 128 MB of data blocks distributed and replicated across nodes. HDFS for storage (Hadoop distributed File System), that allows you to store data of various formats across a cluster. The second one is YARN, for resource management in Hadoop. It allows parallel processing over the data, i.e. stored across HDFS using Map Reduce jobs. As we all know, Managing and Administering Hadoop clusters is not simple, there are several vendors in the market who package Hadoop, has built in software to manage these clusters. There are several libraries and open source built around Hadoop including Hive, HBase , Pig, Oozie, Sqoop etc make life simpler.

Real time Data Analytics

Data streaming is another wave in the analytics and machine learning landscape as it assists organizations in quick decision-making through real-time analytics. Real time data is used in several fields today including threat detection in network, tracking the movement of trains in transit systems etc. Apache Storm is a must-have tool to consider among all the other options available today. Unlike Hadoop that carries out batch processing, Apache Storm is specifically built for transforming streams of data. However, it can be also used for online machine learning, ETL, among others. Its ability to process data faster than its competitors differentiates Apache Storm in carrying out processes at the nodes. Apache Kafka? is a distributed streaming platform for Publishing and subscribe to streams of records, similar to a message queue or enterprise messaging system. It stores streams of records in a fault-tolerant way and process stream of data as they occur. Kafka is run as a cluster on one or more servers that can span multiple datacenters. The Kafka cluster stores streams of records in categories called topics. It can achieve high throughput (millions of messages per second) with limited resources whereas RabbitMQ would need close to 30 nodes to process a million message. One of the Social media company designed for professionals run on 1000’s of Kafka broker organized into 50+ Clusters. Web Activity tracking, Log Aggregation, Stream Processing, Event sourcing etc are common use cases for Kafka.

Horizontal and Vertical Scaling using No SQL database for persistence

No SQL Database such as Apache Cassandra (uses wide column stores) meets the requirements of an ideal horizontally scalable system by allowing for seamless addition of nodes using commodity servers, supports replication, high availability, fault tolerant from failures, partitioning of data in multiple datacenter clusters and sharding. Schema less or ability to change the schema to store any form of information makes No SQL database perfect for storing structured/unstructured data. Similarly, MongoDB is schema less and uses JSON Documents, unlike RDBMS these no SQL databases are extremely flexible. For e.g. in Cassandra tables can be created, deleted and dropped while database is running. Columns can be added on the fly with every write. Some of the largest production deployments include Apple's, with over 75,000 nodes storing over 10 PB of data, Netflix (2,500 nodes, 420 TB, over 1 trillion requests per day)

Data Insights and Visualization

Visualize and Act on real time Data using Technologies such as ELK (Elastic Logstash and Kibana). From finance and behavior analytics to usage monitoring and marketing content performance, get deeper insights into your data with the Elastic Stack, a scalable cross-platform analytics engine. Customize dashboards line charts, maps, and more to identify high-performing regions, analyze sales funnels, and improve website analytical functionality-all in real time. Elasticsearch is a search and analytics engine. Logstash is a server?side data processing pipeline that ingests data from multiple sources in parallel, transforms it, and then sends it to a "stash" like Elasticsearch. Using Kibana, users visualize data with charts and graphs in Elasticsearch. Pushing all the log data in cloud and analyzing them using ELK framework is a very common use case in modern cloud-based applications.

Rise of Edge Computing

With deployments of IoT devices and 5G fast wireless, compute and analytics closer to where data is created is making the case for edge computing. This is done so that data, especially real-time data, does not suffer latency issues that can affect an application’s performance. Devices that monitor manufacturing equipment on a factory floor, or an internet-connected video camera that sends live footage from a remote office, Autonomous cars that needs to make a real time decision to brake etc. Autonomous cars might require a communicate to DataGrid for instant decision making, similar to mobile tower technology, here the cars communicate to DataGrid while on the move. An edge gateway, for example, can process data from an edge device, and then send only the relevant data back through the cloud, reducing bandwidth needs. From a security standpoint, data at the edge can be vulnerable, especially when it’s being handled by different devices that might not be as secure as a centralized or cloud-based system. Applications using 5G technology will change traffic demand patterns, and with its much higher network speed provide the biggest driver for edge computing in mobile cellular network.

Rise of Quantum Computing

In classical computing, a computer runs on bits that have a value of either 0 or 1. Quantum bits, or “qubits,” are similar, in that for practical purposes, we read them as a value of 0 or 1, but they can also hold much more complex information, or even be negative values (10%=-0.23, 15%=0, 60%=1 etc). Quantum Computing will enable the “fifth generation” of computers. Google announced it has a quantum computer that is 100 million times faster than any classical computer in its lab. Another company recently announced its next-generation quantum computer with 2,000 qubits. It can be applied in the field of Protein folding and drug discovery, Portfolio risk optimization etc. However, it also opens up a threat, that well-known encryption algorithm used in SSL/TLS can be broken within minutes. The Indian government in its budget 2020 has announced a National Mission on Quantum Technologies & Applications (NM-QTA) with a total budget outlay of Rs 8000 Crore for a period of five years to be implemented by the Department of Science & Technology (DST).

Hitesh Gupta

Building Acceldata

5 年

Nicely written... Ashwin Rajeeva?do check out.

要查看或添加评论，请登录

Niteen Lall的更多文章

Generative AI – Will it Change Your Business Strategy? --- Episode 1

2023年5月3日

Generative AI – Will it Change Your Business Strategy? --- Episode 1

Generative AI – Will it Change Your Business Strategy? --- Episode 1 Generative AI will change the nature of how we…

2 条评论
Data Observability Data Security and Data AI

2023年3月1日

Data Observability Data Security and Data AI

Data observability, Data Security and AI of Data Data observability is the ability to observe, interrogate, and monitor…
Authentication - Web Apps, Mobile Apps, IOT Devices

2020年2月5日

Authentication - Web Apps, Mobile Apps, IOT Devices

Authentication Factors: Knowledge factors (What you know): A knowledge factor is something you know, such as a user…

4 条评论
Artificial Intelligence, Machine Identity (IOT), Data Regulation, Security and Data Privacy

2020年1月23日

Artificial Intelligence, Machine Identity (IOT), Data Regulation, Security and Data Privacy

Artificial Intelligence (AI) is the emulation of human intelligence process by machines or an IOT device. The process…
Common Security Attacks - Cyber, Mobile, ATMs, Wifi, IOT

2020年1月17日

Common Security Attacks - Cyber, Mobile, ATMs, Wifi, IOT

Juice jacking (Mobile attack) Juice Jacking – a type of cyber attack which originates from USB charging port installed…

1 条评论
RSA Conference 2017- Moscone Center-San Francisco Feb 13-17,2017

2017年2月2日

RSA Conference 2017- Moscone Center-San Francisco Feb 13-17,2017

Put your cybersecurity mettle to the test. Join the RSA team at booth #N3601 at RSA Conference, and take the RSA…

See all articles

Data at Scale: Business & Technology Mapping

Niteen Lall

Director of Engineering

Niteen Lall的更多文章

社区洞察

其他会员也浏览了

Real-Time Data Processing: Architectures and Tools

DATA & AI Summit 2023

A novel way to manage accumulating IIoT bigdata (Feat. Mount)

Data is Business, But Big Data is a Problem

Future of Big Data

Understanding Big Data: An Evolution in Information Processing

Big Data : A Basic Guide

Distributed Data Processing..

Big Data is not just a technology, its a paradigm shift..

Modernize data lakes to be ready for Generative AI

Niteen Lall的更多文章

Generative AI – Will it Change Your Business Strategy? --- Episode 1

Data Observability Data Security and Data AI

Authentication - Web Apps, Mobile Apps, IOT Devices

Artificial Intelligence, Machine Identity (IOT), Data Regulation, Security and Data Privacy

Common Security Attacks - Cyber, Mobile, ATMs, Wifi, IOT

RSA Conference 2017- Moscone Center-San Francisco Feb 13-17,2017

社区洞察

其他会员也浏览了

Real-Time Data Processing: Architectures and Tools

DATA & AI Summit 2023

A novel way to manage accumulating IIoT bigdata (Feat. Mount)

Data is Business, But Big Data is a Problem

Future of Big Data

Understanding Big Data: An Evolution in Information Processing

Big Data : A Basic Guide

Distributed Data Processing..

Big Data is not just a technology, its a paradigm shift..

Modernize data lakes to be ready for Generative AI