Scaling PostgreSQL Database
Database scalability is one of the most crucial aspects of software solutions design and development, as well as the operational aspects in terms of properly designing and configuring a scalable infrastructure that operates the database engine.
In this article we will cover both aspects in brief but somewhat detailed.
Database Software Scalability
Basically, there are three concepts that cover enhancing the scalability of a PostgreSQL database at the software development level.
These three concepts apply not only to PostgreSQL but also to other relational database management systems (RDBMS) such as Oracle, MS-SQL, MySQL, and MariaDB.
Table Indexing
Indexing is the concept of building partial tables that hold columns used in a search criteria that is used to build the where close in a select statement in an SQL based database.
For example, if you have a table of a user that holds the following entity structure:
If you look at the table, you will see that some columns are designated as either a primary key (Id) or an index key (UserName and UserStatus).
When building this table at the RDBMS system level, the RDBMS also build three other related tables one for each of these indexes.
It would be a waste of storage of you are assigning a column as an index key if it is never used in a where statement when selecting from that table. On the other hand if you do not do so and that column is frequently used in select statements; the performance of the database shall degrade by time when the database table gets populated with thousands and millions of records.
In some cases, you frequently use multiple columns in a where statement such as trying to select user records from the table where usernames start with say “A” and their status is Active (1). In this case the RDBMS will need to load both indexes to search for the requested set of records. If you add another index that holds both username and user status columns; that should enhance performance of your database.
You should be careful when adding extra indexes as they require more data storage to be considered.
Partitioning
One of the most used techniques in huge databases when records count in millions is partitioning.
Partitioning is an RDBMS feature where the RDBMS splits a main table into multiple table accessed using some mathematical criterion such as date range, year, or calculated hash.
This allows a program to provide the criterion along with the select statement so that the proper partition is accessed.
Only use partitioning for huge databases and when you can predict enormous growth of data. For example, while we usually keep online frequently accessed records in a separate database for say 12 months, archived data for long-term periods in years is usually designed to use partitions.
Even with such a case if the data is not predicted to grow so huge, there is no need to add the overhead of partitioning unless you know for a fact that your data will grow huge.
Caching
RDBMS engines usually access data within disk tablespaces. The speed of retrieving data is highly dependent on disk speed. For example, NVMe SSD drives are much better than SATA SSD drives who’s in turn are better than old IDE drives.
Yet, all those types of disk drives are never better than random access memory (RAM). Engines like Redis use in memory cache to store datasets that are resulting from frequently accessed data records based on a specific select statement construction.
Adding a memory cache like Redis to your software design and operational environment shall tremendously boost your RDBMS performance.
Database Hardware Infrastructure Scalability
As we mentioned above one of the main hardware components that an RDBMS performance is affected by is the type of storage used and that is usually measured in Read/Write Input/Output Operations Per Second (IOPS). While disk drive storage is still struggling within the boundaries of millisecond access, RAM has reached a nanosecond permeance. But RAM is not considered a permanent storage as it will lose all data stored if power supply is brought offline.
Unless you could build an always online RAM based storage, you should be obliged to continue relying on SSD disk storage.
Vertical Scalability
RDBMS engines are known to be vertically scalable. This means that you add more CPU power and more RAM memory to get better performance. Adding more server nodes in a horizontally scalable model may not be supported by all RDBMS engines, specifically speaking not in PostgreSQL.
Locally Attached vs Shared Storage
The best performance is achieved using locally attached highly performing disk drives with least access time. This is today called NVMe.
Providing cache at the disk controller level adds another layer of performance boosting.
Using shared storage such as network access storage (NAS) or storage area network (SAN) adds another factor affecting performance that is the network latency and network speed.
Although – in some RDBMS engines – such type of storage is used to provide horizontal scalability allowing multiple database serves to be used to serve application requests at the same time; first, not all engines support this concept, and second it adds the networking overhead.
Cache Configuration
If a cache server like Redis is used it should be configured on its own server cluster with an adequate amount of RAM memory to serve frequently used select statement.