Transaction Processing or Analytics ?

Transaction Processing or Analytics ?

Transaction processing systems 

In the early days of business data processing, a write to the database typically corresponded to a commercial transaction taking place: making a sale, placing an order with a supplier, paying an employee’s salary, etc. 

An application typically looks up a small number of records by some key, using an index. Records are inserted or updated based on the user’s input. Because these applications are interactive, the access pattern became known as online transaction processing(OLTP). 

Analytics systems 

It has different access patterns than transaction processing systems . Usually an analytic query needs to scan over a huge number of records, only reading a few columns per record, and calculates aggregate statistics (such as count, sum, or average) rather than returning the raw data to the user. 

These queries are often written by business analysts, and feed into reports that help the management of a company make better decisions (business intelligence). 

Comparing characteristics 

No alt text provided for this image

For example consider a shopping application

 For making concepts clear we will go through the sales module of the application's relational model 

No alt text provided for this image

Fields price and discount price are attributes of sales, fields other than these are foreign keys references to other tables dimension tables. And each row represents a event and demonstrate what, how, when questions for sale happened. 

When the table relationships are visualized, the sales table is in the middle, surrounded by its dimension tables; the connections to these tables are like the rays of a star hence called as “star schema”. And variations in dimension tablets like product images table, store images table are called snowflake schema. 

Sales table captures each sale(event) for later analysis. However this means sales table will grow extremely large because we know how petabytes of data generated by online stores like Apple, Walmart, Flipkart etc. 

 Let's start our analysis , consider we will be analyzing whether customers are more inclined to buy Apple products, depending on the day of the week. 

MYSQL Query Used 

SELECT  

   sd.weekday, 

  p.name AS Product name, 

  SUM(s.quantity) AS Quantity Sold, 

  GROUP_CONCAT(st.name) AS Stores, 

  GROUP_CONCAT(cam.name) AS Campaings, 

FROM sales s 

INNER JOIN sales_date sd ON sd.id = s.date_id   

INNER JOIN products p ON p.id = s.product_id 

INNER JOIN stores st ON st.id = s.store_id 

INNER JOIN campaigns cam ON cam.id = s.campaign_id 

WHERE s.product_id = 35 

GROUP BY sd.weekday 

Things happening in behind 

  1. Relational systems will load all of those rows (each consisting of over 100 attributes in production) from disk into memory 
  2. Parse them 
  3. Filter out those that don’t meet the required conditions.  

Note : That can take a long time in production because we are querying on petabytes of data in production.  

Birth of analytics systems 

   To solve the problems we faced while doing analysis in transactional systems, analytics systems with columnar storage are build specifically for analysing data are born. Same sales data we seen in transactional systems are stored in analytics systems as below. 

Columns storage layout  

No alt text provided for this image

Now, you can wonder now how this helps us to do analytics on petabytes of data.

Columnar compression 

Here is the magic, since data stored in columnar are quite repetitive columnar storage engines compress the data and store by different techniques based on data, but one technique that particularly used is bitmap encoding. 

Example : After columnar compression product_ids will be stored as below. 

No alt text provided for this image

Bitmap indexes such as these are very well suited for the kinds of queries that are common in a data warehouse. And we will execute the same query what we wrote already. 

Load the three bitmaps for product_id = 35 and calculate the bitwise which can be done very efficiently. 

Vectorized processing 

We know that in analytics systems queries need to scan over millions of records. Developers of analytical databases also worry about efficiently using the bandwidth from main memory into the CPU cache, avoiding branch mispredictions and bubbles in the CPU instruction processing pipeline, and making use of single-instruction-multi-data (SIMD) instructions in modern CPUs. 

Things happening behind 

  1.  The query engine can take a chunk of compressed column data that fits comfortably in the CPU’s L1 cache 
  2. Iterate through it in a tight loop 

Note : Since we are doing column compression, we can fit more data in L1 Cache and process

  1. Operators, such as the bitwise AND and OR described previously, can be designed to operate on such chunks of compressed column data directly. This technique is known as vectorized processing 

Sorting Column storage data 

  1. Sorting all the columns has no meaning because we know that the kth item in one column belongs to the same row as the kth item in another column.. 
  2. Sorting can be configured on particular column by database administrator 

For example, if queries often target date ranges, such as the last month, it might make sense to make date_id the first sort key in sales table.

  1. For Several different sort orders data needs to be replicated to multiple machines anyway, so that you don’t lose data if one machine fails. You might as well store that redundant data sorted in different ways so that when you’re processing a query, you can use the version that best fits the query pattern. 

Example : You can store in same machine with different name :D 

Writes in column storage

These optimizations make sense in data warehouses, because most of the load consists of large read-only queries run by analysts. Column-oriented storage, compression, and sorting all help to make those read queries faster. However, they have the downside of making writes more difficult. 

Materialized views 

Data warehouse queries often involve an aggregate function, such as COUNT, SUM, AVG, MIN, or MAX in SQL. If the same aggregates are used by many different queries, it can be wasteful to crunch through the raw data every time hence creating such a cache is a materialized view. 

In some data warehouses relational databases are also used for analytics, he we can do this creating virtual views, a table-like object whose contents are the results of some query and it won't be efficient as column storage when data grows and it can be optimized with materialized data cubes with pre-computed values.  

Note : Update and write are more expensive in relational model views 

Hope you have learnt something new in this article, please like share comment that may help others too.

Happy learning!

要查看或添加评论,请登录

Divagar Carlmarx的更多文章

  • Processing large amount of CSV data using JAVA

    Processing large amount of CSV data using JAVA

    Have you worked with large amount of csv DATA in GBs ?? And you have memory constraints ?? This might help for you…

    1 条评论
  • Fell in love with Scala

    Fell in love with Scala

    I was a hard core JAVA developer in both my professional and learning journey, but recently for a reason i have started…

  • Scala - Sealed Class Hierarchies

    Scala - Sealed Class Hierarchies

    In my previous article i had shared you regarding Option feature in Scala, in this article come lets discuss about…

  • Scala - NULL handling with MAP

    Scala - NULL handling with MAP

    Sharing three useful types that express a very useful concept i learned today, for NULL handling. Most languages have a…

  • WHY and HOW I started using IntelliJ IDE and SCALA

    WHY and HOW I started using IntelliJ IDE and SCALA

    I was using Eclipse IDE for java enterprise development from beginning of my career and learning journey. In my life…

  • Product based company team management strategies for productivity

    Product based company team management strategies for productivity

    I am sharing my knowledge i got in my professional and personal life as software developer for team management. Lets…

  • Big Data Volume

    Big Data Volume

    Big Data Volume Data volume is characterized by the amount of data that is generated continuously. Different data types…

    2 条评论
  • Distributed Systems - Multi Leader Replication

    Distributed Systems - Multi Leader Replication

    We know in Leader follower model, client can able to write only by leader this if leader is down for any reason, you…

  • Distributed Systems - Replication

    Distributed Systems - Replication

    Replication means keeping a copy of the same data on multiple machines that are connected via a network. Reasons for…

  • Designing key value database with btree

    Designing key value database with btree

    Introduced in 1970 and called “ubiquitous” less than 10 years later , B-trees have stood the test of time very well…

社区洞察

其他会员也浏览了