Big Data and Spark difference between questionnaire: Part 4

Big Data and Spark difference between questionnaire: Part 4

This is the continuous Article,

Part 1 link:?Big Data and Spark difference between questionnaire: Part 1

Part 2 link:?Big Data and Spark difference between questionnaire: Part 2

Part 3 link:?Big Data and Spark difference between questionnaire: Part 3

Batch processing vs Streaming data processing vs Real-time data processing

Batch processing is when the processing and analysis happens on a set of data that have already been stored over a period of time.

Batch processing having ability to deal with large volume of data.

Example: quarterly, monthly, weekly data aggregation.

Streaming data processing happens as the data flows through a system. This results in analysis and reporting of events as it happens. An example would be fraud detection or intrusion detection. Streaming data processing means that the data will be analyzed and that actions will be taken on the data within a short period of time or near real-time, as best as it can

Real-time data processing guarantees that the real-time data will be acted on within a period of time, like milliseconds. An example would be for-real time application that purchases a stock within 20ms of receiving a desired price.

Ref : https://www.confluent.io/learn/batch-vs-real-time-data-processing/

Data Lake vs Data Warehouse

Data lakes and data warehouses are used in organizations to aggregate multiple sources of data but vary in its users and optimizations. Think of a data lake as where streams and rivers of data from various sources meet. All data is allowed, no matter if it is structured or unstructured and no processing is done to the data until after it is in the data lake. It is highly attractive to data scientists, applications that are leveraging the data for AI/ML where new ways of using the data are possible. A data warehouse is a centralized place for structured data to be analyzed for specific purposes related to business insights. The requirements for reporting is known ahead of time during the planning and design of a data warehouse and the ETL process. It is best suited for data sources that can be extracted using a batch process and reports that deliver high value to the business.

Another way to think about it is that data lakes are schema-less and more flexible to store relational data from business applications as well as non-relational logs from servers, and places like social media. Where as data warehouses rely on a schema and only accepting relational data.

Data Lake vs Data Base

Data warehouses and databases, both stores structured data, but were built for differences in scale and number of sources. A database thrives in a monolithic environment where the data is being generated by one application. A data warehouse is also relational and is built to support large volumes of data, from across all departments of an organization. Both support powerful querying languages and reporting capabilities and is used by primarily the business members of an organization.

Partitioning vs Bucketing

Partitioning – Apache Hive organizes tables into partitions for grouping same type of data together based on a column or partition key. Each table in the hive can have one or more partition keys to identify a particular partition. Using partition we can make it faster to do queries on slices of the data. It distributes execution load horizontally. Partition is effective for low volume data. But there some queries like group by on high volume of data take a long time to execute. For example, grouping population of China will take a long time as compared to a grouping of the population in Vatican City.

Bucketing – In Hive Tables or partition are subdivided into buckets based on the hash function of a column in the table to give extra structure to the data that may be used for more efficient queries.

It provides faster query response like portioning.

In bucketing due to equal volumes of data in each partition, joins at Map side will be quicker.

We can define a number of buckets during table creation. But loading of an equal volume of data has to be done manually by programmers.

No alt text provided for this image


Ref : https://data-flair.training/blogs/hive-partitioning-vs-bucketing/

Manged tables vs External tables

Internal /Manage tables:

  • Hive takes the data file we load to the table to the /database-name.db/table-name inside our warehouse
  • Internal table supports TRUNCATE command
  • Internal tables also have ACID Support (if the file format is ORC)
  • Internal tables also support query result caching means it can store the result of the already executed hive query for subsequent query
  • Metadata and Table data both will be removed as soon as the table is dropped

External tables :

  • Hive won’t take data to our warehouse
  • The External table does not support the TRUNCATE command
  • No support for ACID transaction property
  • Doesn’t support query result caching
  • Only metadata will be removed when the External table is dropped.
  • Most of the Productionized tables are external tables.

Thank you to be continue (4 /5)
PrakashBabu Polisetty

VP - Technology Lead [ AWS & GCP Certified | √ JAVA √ J2EE √ SpringBoot √ Microservices √ Kubernetes | Openshift √ PCF √ Kafka √ Cloud √ Python √ BigData[PySpark|Hadoop] || System Design & Cloud Enthusiast ]

3 年

Very useful

回复
Sundarraj T

Azure Data Engineer@TechM | Ex HCL | Ex Cognizant | PySpark | HDFS | Hive | Spark | Python | Sqoop | Azure Cloud | SQL | Hadoop Ecosystem

3 年

Thanks for sharing

回复

要查看或添加评论,请登录

Saikrishna Cheruvu的更多文章

社区洞察

其他会员也浏览了