登录查看更多内容

Transaction Processing or Analytics ?

Divagar Carlmarx

?? Building microservices with k8 for ?? ?? ?? ??

发布日期: 2020年1月4日

Transaction processing systems

In the early days of business data processing, a write to the database typically corresponded to a commercial transaction taking place: making a sale, placing an order with a supplier, paying an employee’s salary, etc.

An application typically looks up a small number of records by some key, using an index. Records are inserted or updated based on the user’s input. Because these applications are interactive, the access pattern became known as online transaction processing(OLTP).

Analytics systems

It has different access patterns than transaction processing systems . Usually an analytic query needs to scan over a huge number of records, only reading a few columns per record, and calculates aggregate statistics (such as count, sum, or average) rather than returning the raw data to the user.

These queries are often written by business analysts, and feed into reports that help the management of a company make better decisions (business intelligence).

Comparing characteristics

For example consider a shopping application.

For making concepts clear we will go through the sales module of the application's relational model

Fields price and discount price are attributes of sales, fields other than these are foreign keys references to other tables dimension tables. And each row represents a event and demonstrate what, how, when questions for sale happened.

When the table relationships are visualized, the sales table is in the middle, surrounded by its dimension tables; the connections to these tables are like the rays of a star hence called as “star schema”. And variations in dimension tablets like product images table, store images table are called snowflake schema.

Sales table captures each sale(event) for later analysis. However this means sales table will grow extremely large because we know how petabytes of data generated by online stores like Apple, Walmart, Flipkart etc.

Let's start our analysis , consider we will be analyzing whether customers are more inclined to buy Apple products, depending on the day of the week.

MYSQL Query Used

SELECT

sd.weekday,

p.name AS Product name,

SUM(s.quantity) AS Quantity Sold,

GROUP_CONCAT(st.name) AS Stores,

GROUP_CONCAT(cam.name) AS Campaings,

FROM sales s

INNER JOIN sales_date sd ON sd.id = s.date_id

INNER JOIN products p ON p.id = s.product_id

INNER JOIN stores st ON st.id = s.store_id

INNER JOIN campaigns cam ON cam.id = s.campaign_id

WHERE s.product_id = 35

GROUP BY sd.weekday

Things happening in behind

Relational systems will load all of those rows (each consisting of over 100 attributes in production) from disk into memory
Parse them
Filter out those that don’t meet the required conditions.

Note : That can take a long time in production because we are querying on petabytes of data in production.

Birth of analytics systems

To solve the problems we faced while doing analysis in transactional systems, analytics systems with columnar storage are build specifically for analysing data are born. Same sales data we seen in transactional systems are stored in analytics systems as below.

Columns storage layout

Now, you can wonder now how this helps us to do analytics on petabytes of data.

Columnar compression

Here is the magic, since data stored in columnar are quite repetitive columnar storage engines compress the data and store by different techniques based on data, but one technique that particularly used is bitmap encoding.

Example : After columnar compression product_ids will be stored as below.

Bitmap indexes such as these are very well suited for the kinds of queries that are common in a data warehouse. And we will execute the same query what we wrote already.

Load the three bitmaps for product_id = 35 and calculate the bitwise which can be done very efficiently.

Vectorized processing

We know that in analytics systems queries need to scan over millions of records. Developers of analytical databases also worry about efficiently using the bandwidth from main memory into the CPU cache, avoiding branch mispredictions and bubbles in the CPU instruction processing pipeline, and making use of single-instruction-multi-data (SIMD) instructions in modern CPUs.

Things happening behind

The query engine can take a chunk of compressed column data that fits comfortably in the CPU’s L1 cache
Iterate through it in a tight loop

Note : Since we are doing column compression, we can fit more data in L1 Cache and process

Operators, such as the bitwise AND and OR described previously, can be designed to operate on such chunks of compressed column data directly. This technique is known as vectorized processing

Sorting Column storage data

Sorting all the columns has no meaning because we know that the kth item in one column belongs to the same row as the kth item in another column..
Sorting can be configured on particular column by database administrator

For example, if queries often target date ranges, such as the last month, it might make sense to make date_id the first sort key in sales table.

For Several different sort orders data needs to be replicated to multiple machines anyway, so that you don’t lose data if one machine fails. You might as well store that redundant data sorted in different ways so that when you’re processing a query, you can use the version that best fits the query pattern.

Example : You can store in same machine with different name :D

Writes in column storage

These optimizations make sense in data warehouses, because most of the load consists of large read-only queries run by analysts. Column-oriented storage, compression, and sorting all help to make those read queries faster. However, they have the downside of making writes more difficult.

Materialized views

Data warehouse queries often involve an aggregate function, such as COUNT, SUM, AVG, MIN, or MAX in SQL. If the same aggregates are used by many different queries, it can be wasteful to crunch through the raw data every time hence creating such a cache is a materialized view.

In some data warehouses relational databases are also used for analytics, he we can do this creating virtual views, a table-like object whose contents are the results of some query and it won't be efficient as column storage when data grows and it can be optimized with materialized data cubes with pre-computed values.

Note : Update and write are more expensive in relational model views

Hope you have learnt something new in this article, please like share comment that may help others too.

Happy learning!

要查看或添加评论，请登录

Divagar Carlmarx的更多文章

Processing large amount of CSV data using JAVA

2022年3月15日

Processing large amount of CSV data using JAVA

Have you worked with large amount of csv DATA in GBs ?? And you have memory constraints ?? This might help for you…

1 条评论
Fell in love with Scala

2020年6月3日

Fell in love with Scala

I was a hard core JAVA developer in both my professional and learning journey, but recently for a reason i have started…
Scala - Sealed Class Hierarchies

2020年5月28日

Scala - Sealed Class Hierarchies

In my previous article i had shared you regarding Option feature in Scala, in this article come lets discuss about…
Scala - NULL handling with MAP

2020年5月27日

Scala - NULL handling with MAP

Sharing three useful types that express a very useful concept i learned today, for NULL handling. Most languages have a…
WHY and HOW I started using IntelliJ IDE and SCALA

2020年5月22日

WHY and HOW I started using IntelliJ IDE and SCALA

I was using Eclipse IDE for java enterprise development from beginning of my career and learning journey. In my life…
Product based company team management strategies for productivity

2020年4月30日

Product based company team management strategies for productivity

I am sharing my knowledge i got in my professional and personal life as software developer for team management. Lets…
Big Data Volume

2020年2月1日

Big Data Volume

Big Data Volume Data volume is characterized by the amount of data that is generated continuously. Different data types…

2 条评论
Distributed Systems - Multi Leader Replication

2020年1月6日

Distributed Systems - Multi Leader Replication

We know in Leader follower model, client can able to write only by leader this if leader is down for any reason, you…
Distributed Systems - Replication

2020年1月6日

Distributed Systems - Replication

Replication means keeping a copy of the same data on multiple machines that are connected via a network. Reasons for…
Designing key value database with btree

2020年1月1日

Designing key value database with btree

Introduced in 1970 and called “ubiquitous” less than 10 years later , B-trees have stood the test of time very well…

See all articles

Transaction Processing or Analytics ?

Divagar Carlmarx

?? Building microservices with k8 for ?? ?? ?? ??

Divagar Carlmarx的更多文章

社区洞察

其他会员也浏览了

Data Sourcing Mastery: Strategies for High-Quality Inputs

10 Easy-to-Use Data Cleaning Techniques for Better Results

Effective Strategies for Removing Duplicates from a Fact Table in Production

This is how you can incorporate external context into your application

Bad Data is everywhere – and it's costing organisations time, money, and risk exposure.

The Three Core Data Types Every Data Analyst Should Master

DATA ENRICHMENT

APIs and Data Analytics: Enhancing Your Capabilities

Application Insights Plan Comparison

8 Advantages of Outsourcing Data Analytics: Get More from Your Data

Divagar Carlmarx的更多文章

Processing large amount of CSV data using JAVA

Fell in love with Scala

Scala - Sealed Class Hierarchies

Scala - NULL handling with MAP

WHY and HOW I started using IntelliJ IDE and SCALA

Product based company team management strategies for productivity

Big Data Volume

Distributed Systems - Multi Leader Replication

Distributed Systems - Replication

Designing key value database with btree

社区洞察

其他会员也浏览了

Data Sourcing Mastery: Strategies for High-Quality Inputs

10 Easy-to-Use Data Cleaning Techniques for Better Results

Effective Strategies for Removing Duplicates from a Fact Table in Production

This is how you can incorporate external context into your application

Bad Data is everywhere – and it's costing organisations time, money, and risk exposure.

The Three Core Data Types Every Data Analyst Should Master

DATA ENRICHMENT

APIs and Data Analytics: Enhancing Your Capabilities

Application Insights Plan Comparison

8 Advantages of Outsourcing Data Analytics: Get More from Your Data