Transaction Processing or Analytics ?
Transaction processing systems
In the early days of business data processing, a write to the database typically corresponded to a commercial transaction taking place: making a sale, placing an order with a supplier, paying an employee’s salary, etc.
An application typically looks up a small number of records by some key, using an index. Records are inserted or updated based on the user’s input. Because these applications are interactive, the access pattern became known as online transaction processing(OLTP).
Analytics systems
It has different access patterns than transaction processing systems . Usually an analytic query needs to scan over a huge number of records, only reading a few columns per record, and calculates aggregate statistics (such as count, sum, or average) rather than returning the raw data to the user.
These queries are often written by business analysts, and feed into reports that help the management of a company make better decisions (business intelligence).
Comparing characteristics
For example consider a shopping application.
For making concepts clear we will go through the sales module of the application's relational model
Fields price and discount price are attributes of sales, fields other than these are foreign keys references to other tables dimension tables. And each row represents a event and demonstrate what, how, when questions for sale happened.
When the table relationships are visualized, the sales table is in the middle, surrounded by its dimension tables; the connections to these tables are like the rays of a star hence called as “star schema”. And variations in dimension tablets like product images table, store images table are called snowflake schema.
Sales table captures each sale(event) for later analysis. However this means sales table will grow extremely large because we know how petabytes of data generated by online stores like Apple, Walmart, Flipkart etc.
Let's start our analysis , consider we will be analyzing whether customers are more inclined to buy Apple products, depending on the day of the week.
MYSQL Query Used
SELECT
sd.weekday,
p.name AS Product name,
SUM(s.quantity) AS Quantity Sold,
GROUP_CONCAT(st.name) AS Stores,
GROUP_CONCAT(cam.name) AS Campaings,
FROM sales s
INNER JOIN sales_date sd ON sd.id = s.date_id
INNER JOIN products p ON p.id = s.product_id
INNER JOIN stores st ON st.id = s.store_id
INNER JOIN campaigns cam ON cam.id = s.campaign_id
WHERE s.product_id = 35
GROUP BY sd.weekday
Things happening in behind
- Relational systems will load all of those rows (each consisting of over 100 attributes in production) from disk into memory
- Parse them
- Filter out those that don’t meet the required conditions.
Note : That can take a long time in production because we are querying on petabytes of data in production.
Birth of analytics systems
To solve the problems we faced while doing analysis in transactional systems, analytics systems with columnar storage are build specifically for analysing data are born. Same sales data we seen in transactional systems are stored in analytics systems as below.
Columns storage layout
Now, you can wonder now how this helps us to do analytics on petabytes of data.
Columnar compression
Here is the magic, since data stored in columnar are quite repetitive columnar storage engines compress the data and store by different techniques based on data, but one technique that particularly used is bitmap encoding.
Example : After columnar compression product_ids will be stored as below.
Bitmap indexes such as these are very well suited for the kinds of queries that are common in a data warehouse. And we will execute the same query what we wrote already.
Load the three bitmaps for product_id = 35 and calculate the bitwise which can be done very efficiently.
Vectorized processing
We know that in analytics systems queries need to scan over millions of records. Developers of analytical databases also worry about efficiently using the bandwidth from main memory into the CPU cache, avoiding branch mispredictions and bubbles in the CPU instruction processing pipeline, and making use of single-instruction-multi-data (SIMD) instructions in modern CPUs.
Things happening behind
- The query engine can take a chunk of compressed column data that fits comfortably in the CPU’s L1 cache
- Iterate through it in a tight loop
Note : Since we are doing column compression, we can fit more data in L1 Cache and process
- Operators, such as the bitwise AND and OR described previously, can be designed to operate on such chunks of compressed column data directly. This technique is known as vectorized processing
Sorting Column storage data
- Sorting all the columns has no meaning because we know that the kth item in one column belongs to the same row as the kth item in another column..
- Sorting can be configured on particular column by database administrator
For example, if queries often target date ranges, such as the last month, it might make sense to make date_id the first sort key in sales table.
- For Several different sort orders data needs to be replicated to multiple machines anyway, so that you don’t lose data if one machine fails. You might as well store that redundant data sorted in different ways so that when you’re processing a query, you can use the version that best fits the query pattern.
Example : You can store in same machine with different name :D
Writes in column storage
These optimizations make sense in data warehouses, because most of the load consists of large read-only queries run by analysts. Column-oriented storage, compression, and sorting all help to make those read queries faster. However, they have the downside of making writes more difficult.
Materialized views
Data warehouse queries often involve an aggregate function, such as COUNT, SUM, AVG, MIN, or MAX in SQL. If the same aggregates are used by many different queries, it can be wasteful to crunch through the raw data every time hence creating such a cache is a materialized view.
In some data warehouses relational databases are also used for analytics, he we can do this creating virtual views, a table-like object whose contents are the results of some query and it won't be efficient as column storage when data grows and it can be optimized with materialized data cubes with pre-computed values.
Note : Update and write are more expensive in relational model views
Hope you have learnt something new in this article, please like share comment that may help others too.
Happy learning!