Big Data and Spark difference between questionnaire: Part 4
This is the continuous Article,
Batch processing vs Streaming data processing vs Real-time data processing
Batch processing is when the processing and analysis happens on a set of data that have already been stored over a period of time.
Batch processing having ability to deal with large volume of data.
Example: quarterly, monthly, weekly data aggregation.
Streaming data processing happens as the data flows through a system. This results in analysis and reporting of events as it happens. An example would be fraud detection or intrusion detection. Streaming data processing means that the data will be analyzed and that actions will be taken on the data within a short period of time or near real-time, as best as it can
Real-time data processing guarantees that the real-time data will be acted on within a period of time, like milliseconds. An example would be for-real time application that purchases a stock within 20ms of receiving a desired price.
Ref : https://www.confluent.io/learn/batch-vs-real-time-data-processing/
Data Lake vs Data Warehouse
Data lakes and data warehouses are used in organizations to aggregate multiple sources of data but vary in its users and optimizations. Think of a data lake as where streams and rivers of data from various sources meet. All data is allowed, no matter if it is structured or unstructured and no processing is done to the data until after it is in the data lake. It is highly attractive to data scientists, applications that are leveraging the data for AI/ML where new ways of using the data are possible. A data warehouse is a centralized place for structured data to be analyzed for specific purposes related to business insights. The requirements for reporting is known ahead of time during the planning and design of a data warehouse and the ETL process. It is best suited for data sources that can be extracted using a batch process and reports that deliver high value to the business.
Another way to think about it is that data lakes are schema-less and more flexible to store relational data from business applications as well as non-relational logs from servers, and places like social media. Where as data warehouses rely on a schema and only accepting relational data.
Data Lake vs Data Base
Data warehouses and databases, both stores structured data, but were built for differences in scale and number of sources. A database thrives in a monolithic environment where the data is being generated by one application. A data warehouse is also relational and is built to support large volumes of data, from across all departments of an organization. Both support powerful querying languages and reporting capabilities and is used by primarily the business members of an organization.
领英推荐
Partitioning vs Bucketing
Partitioning – Apache Hive organizes tables into partitions for grouping same type of data together based on a column or partition key. Each table in the hive can have one or more partition keys to identify a particular partition. Using partition we can make it faster to do queries on slices of the data. It distributes execution load horizontally. Partition is effective for low volume data. But there some queries like group by on high volume of data take a long time to execute. For example, grouping population of China will take a long time as compared to a grouping of the population in Vatican City.
Bucketing – In Hive Tables or partition are subdivided into buckets based on the hash function of a column in the table to give extra structure to the data that may be used for more efficient queries.
It provides faster query response like portioning.
In bucketing due to equal volumes of data in each partition, joins at Map side will be quicker.
We can define a number of buckets during table creation. But loading of an equal volume of data has to be done manually by programmers.
Ref : https://data-flair.training/blogs/hive-partitioning-vs-bucketing/
Manged tables vs External tables
Internal /Manage tables:
External tables :
Thank you to be continue (4 /5)
VISA
3 年I'm curious
VP - Technology Lead [ AWS & GCP Certified | √ JAVA √ J2EE √ SpringBoot √ Microservices √ Kubernetes | Openshift √ PCF √ Kafka √ Cloud √ Python √ BigData[PySpark|Hadoop] || System Design & Cloud Enthusiast ]
3 年Very useful
Azure Data Engineer@TechM | Ex HCL | Ex Cognizant | PySpark | HDFS | Hive | Spark | Python | Sqoop | Azure Cloud | SQL | Hadoop Ecosystem
3 年Thanks for sharing