登录查看更多内容

Big Data and Spark difference between questionnaire: Part 4

Saikrishna Cheruvu

Lead Developer | Data Engineer | MLOPS | ex@ BOFA

发布日期: 2021年7月5日

+ 关注

This is the continuous Article,

Part 1 link:?Big Data and Spark difference between questionnaire: Part 1

Part 2 link:?Big Data and Spark difference between questionnaire: Part 2

Part 3 link:?Big Data and Spark difference between questionnaire: Part 3

Batch processing vs Streaming data processing vs Real-time data processing

Batch processing is when the processing and analysis happens on a set of data that have already been stored over a period of time.

Batch processing having ability to deal with large volume of data.

Example: quarterly, monthly, weekly data aggregation.

Streaming data processing happens as the data flows through a system. This results in analysis and reporting of events as it happens. An example would be fraud detection or intrusion detection. Streaming data processing means that the data will be analyzed and that actions will be taken on the data within a short period of time or near real-time, as best as it can

Real-time data processing guarantees that the real-time data will be acted on within a period of time, like milliseconds. An example would be for-real time application that purchases a stock within 20ms of receiving a desired price.

Ref : https://www.confluent.io/learn/batch-vs-real-time-data-processing/

Data Lake vs Data Warehouse

Data lakes and data warehouses are used in organizations to aggregate multiple sources of data but vary in its users and optimizations. Think of a data lake as where streams and rivers of data from various sources meet. All data is allowed, no matter if it is structured or unstructured and no processing is done to the data until after it is in the data lake. It is highly attractive to data scientists, applications that are leveraging the data for AI/ML where new ways of using the data are possible. A data warehouse is a centralized place for structured data to be analyzed for specific purposes related to business insights. The requirements for reporting is known ahead of time during the planning and design of a data warehouse and the ETL process. It is best suited for data sources that can be extracted using a batch process and reports that deliver high value to the business.

Another way to think about it is that data lakes are schema-less and more flexible to store relational data from business applications as well as non-relational logs from servers, and places like social media. Where as data warehouses rely on a schema and only accepting relational data.

Data Lake vs Data Base

Data warehouses and databases, both stores structured data, but were built for differences in scale and number of sources. A database thrives in a monolithic environment where the data is being generated by one application. A data warehouse is also relational and is built to support large volumes of data, from across all departments of an organization. Both support powerful querying languages and reporting capabilities and is used by primarily the business members of an organization.

领英推荐

Big Data vs. Fast Data: The Evolution of Speed in…

Pratibha Kumari J. 6 个月前

What is Big Data? / Uses of Big Data / Types Of Big…

Pratibha Kumari J. 1 年前

The Enchanted Journey of Data Pipelines: From Raw Data…

Zara Harvey 5 个月前

Partitioning vs Bucketing

Partitioning – Apache Hive organizes tables into partitions for grouping same type of data together based on a column or partition key. Each table in the hive can have one or more partition keys to identify a particular partition. Using partition we can make it faster to do queries on slices of the data. It distributes execution load horizontally. Partition is effective for low volume data. But there some queries like group by on high volume of data take a long time to execute. For example, grouping population of China will take a long time as compared to a grouping of the population in Vatican City.

Bucketing – In Hive Tables or partition are subdivided into buckets based on the hash function of a column in the table to give extra structure to the data that may be used for more efficient queries.

It provides faster query response like portioning.

In bucketing due to equal volumes of data in each partition, joins at Map side will be quicker.

We can define a number of buckets during table creation. But loading of an equal volume of data has to be done manually by programmers.

Ref : https://data-flair.training/blogs/hive-partitioning-vs-bucketing/

Manged tables vs External tables

Internal /Manage tables:

Hive takes the data file we load to the table to the /database-name.db/table-name inside our warehouse
Internal table supports TRUNCATE command
Internal tables also have ACID Support (if the file format is ORC)
Internal tables also support query result caching means it can store the result of the already executed hive query for subsequent query
Metadata and Table data both will be removed as soon as the table is dropped

External tables :

Hive won’t take data to our warehouse
The External table does not support the TRUNCATE command
No support for ACID transaction property
Doesn’t support query result caching
Only metadata will be removed when the External table is dropped.
Most of the Productionized tables are external tables.

Thank you to be continue (4 /5)

Sudhakar Gollpalle

VISA

3 年

I'm curious

PrakashBabu Polisetty

VP - Technology Lead [ AWS & GCP Certified | √ JAVA √ J2EE √ SpringBoot √ Microservices √ Kubernetes | Openshift √ PCF √ Kafka √ Cloud √ Python √ BigData[PySpark|Hadoop] || System Design & Cloud Enthusiast ]

3 年

Very useful

Sundarraj T

3 年

Thanks for sharing

查看更多评论

要查看或添加评论，请登录

Saikrishna Cheruvu的更多文章

How Databricks AI/BI is Revolutionizing BI and Overtaking Power BI

2024年8月4日

How Databricks AI/BI is Revolutionizing BI and Overtaking Power BI

In recent years, the landscape of Business Intelligence (BI) has witnessed significant transformations. One of the most…
"Which tool is the right choice for cloud data transformation?" ?? #Cloud #DataTransformation #Databricks #DecisionMaking #Dbt

2024年6月30日

"Which tool is the right choice for cloud data transformation?" ?? #Cloud #DataTransformation #Databricks #DecisionMaking #Dbt

I am trying to attempt a comparison between dbt and Databricks (delta live tables) Note: Not prompted and copied from…

3 条评论
Problems with scalable data systems need creative approaches.

2024年4月7日

Problems with scalable data systems need creative approaches.

Maybe chatGpt will help to write the code, not the solutions that we need to do with human intelligence. (?? soon the…

3 条评论
Datasbricks vs Snowflake ??part 1??

2023年8月19日

Datasbricks vs Snowflake ??part 1??

Snowflake and Databricks have wonderful features and most of them are common. If any feature is released on one of the…

4 条评论
What is Z-Order on Databricks?

2023年4月1日

What is Z-Order on Databricks?

What is Z-Order? We can compare the z-order with the cluster index in Oracle (I am a fan of SQL and databases, so my…
SQL Statement Execution API by Databricks

2023年3月9日

SQL Statement Execution API by Databricks

Recently, Databricks released an API for the execution of SQL statements. as of now, this is available on AWS and Azure…

2 条评论
What is Data Mesh?

2022年11月2日

What is Data Mesh?

What is a data mesh? Data mesh is not a technology; it is a conceptual theory of what types of applications we can…

3 条评论
Enterprise Scale Analytics/AI

2022年10月31日

Enterprise Scale Analytics/AI

few lines on ESA Enterprise scale is an architecture approach and reference implementation that enables effective…
Data bricks Governance and Security(Data masking) Implementation with example

2022年10月19日

Data bricks Governance and Security(Data masking) Implementation with example

Some lines about Data masking: Data masking is a technique for creating a dummy data (fake) but realistic version of…

2 条评论
Building Python SDK for Databricks REST API

2022年10月17日

Building Python SDK for Databricks REST API

This article is about a project I've started to work on lately. Please welcome Databricsk REST API - Python.

See all articles

Big Data and Spark difference between questionnaire: Part 4

Saikrishna Cheruvu

Lead Developer | Data Engineer | MLOPS | ex@ BOFA

领英推荐

Saikrishna Cheruvu的更多文章

社区洞察

其他会员也浏览了

#StridingTowardsTheIntelligentWorld-Big Data Applications Stride Towards Proactive and Intelligent Decision-Making

Hive Optimization 50 Tips

Big Data Demystified

A summary to understand the value of Microsoft products from raw data to Large Language Models

Big Data Applications and Examples

<BIG DATA>

Data Products - the hard parts!

Big Data, Big...WHAT?

Data Lake

Blending Seamlessly Current & Historical Data: LeanXcale Bidimensional Partitioning

领英推荐

Saikrishna Cheruvu的更多文章

How Databricks AI/BI is Revolutionizing BI and Overtaking Power BI

"Which tool is the right choice for cloud data transformation?" ?? #Cloud #DataTransformation #Databricks #DecisionMaking #Dbt

Problems with scalable data systems need creative approaches.

Datasbricks vs Snowflake ??part 1??

What is Z-Order on Databricks?

SQL Statement Execution API by Databricks

What is Data Mesh?

Enterprise Scale Analytics/AI

Data bricks Governance and Security(Data masking) Implementation with example

Building Python SDK for Databricks REST API

社区洞察

其他会员也浏览了

#StridingTowardsTheIntelligentWorld-Big Data Applications Stride Towards Proactive and Intelligent Decision-Making

Hive Optimization 50 Tips

Big Data Demystified

A summary to understand the value of Microsoft products from raw data to Large Language Models

Big Data Applications and Examples

<BIG DATA>

Data Products - the hard parts!

Big Data, Big...WHAT?

Data Lake

Blending Seamlessly Current & Historical Data: LeanXcale Bidimensional Partitioning