A real life comparison

A real life comparison

Hi linkedin readers. This is (after a huge delay..) part 2 of the “Mongo VS Postgres - Real life comparison” article and it will be about the following subjects:

  • Extensions
  • Indexing
  • Backup and recovery
  • Practical max size per node
  • Sharding
  • Monitoring
  • Connection pooling
  • Proxies
  • Development and Administration tools.

If you haven't had the chance to read the first part by now, I highly suggest doing so before heading on with this one using this link: https://www.dhirubhai.net/pulse/mongo-vs-postgres-robert-yackobian/?trackingId=shzA5XN2QVqU0hkUCE0vbg%3D%3D

So without further ado let's start with the first subject.

Extensions

Extensions are functionality provided on top of the core database functionality. This is the main option for the community to add and change database functionality without being code contributors.

Postgres is easily extensible. There are hundreds (possibly thousands) of extensions available for every aspect of the database and are a great option to extend the database functionality. Using a Postgres extension couldn't be easier. Install the Postgres contrib package once (in most systems it is already installed) and run a single command to add the extension. For example, to add encryption functionality, add the pgcrypto extension: psql -d your_database -c “create extension pgcrypto”?

I mentioned the contrib package. It’s a linux package and includes the most used/values/stable postgres extensions as listed here https://www.postgresql.org/docs/current/contrib.html . There are countless others which were not included in the contrib package.

For example, I found this cool extension for Postgres handling JSON which is not included in the contrib package https://github.com/postgrespro/jsquery/blob/master/README.md .

In fact the postgres extensions are so capable you can build an entire database system on top of postgres, for example - citus database (https://www.citusdata.com/ ). Citus database is a sharding solution for postgres and it adds the sharding functionality using a single extension to the postgres core binaries.

I found no third party extensions for Mongo, so it is limited to the functionality delivered by Mongo corporation.

How about Percona’s Mongo (https://www.percona.com/Mongo/software/Mongo-distribution )? you may ask?

Please don’t confuse it for being an extension. It is a separate distribution (fork). Postgres has plenty of different forks, greenplum and vertica to name a few.

Conclusion, Postgres is the winner if you need your database to be able to “acquire” new functionality within time.

Indexing

What good is a database if data cannot be accessed easily?

Mongo delivers great index types that when utilized properly (assuming a good database design) will get you the data quickly in most scenarios.

Mongo’s indexes - Single Field,Compound Index, Multikey Index, Geospatial Index, Text Indexes,Unique Indexes, Partial Indexes,? and Sparse Indexes. TTL indexes are not mentioned as they are used for other functionality rather than fast access to data.

The postgres built in list is shorter but still delivers roughly the same functionality:

B-tree, Hash, ,GiST,SP-GiST,GIN,BRIN

Consider the following JSON. document

{"toy_name":"dog", "can_do": {["eat","sleep","bark"]}}

assuming the toys collection contains millions of toy documents and you are looking for those that can bark, use this to find them in Mongo:

db.toys .find ({"can_do":"bark"})

A suitable Mongo index for this? scenario would be:?

db.toys ,createIndex ({"can_do":1})

This will index the content of “can_do”. If “can_do” includes an array as here: "can_do": {["eat","sleep","bark"]} It will be indexed properly.

A similar postgres index will be:

create index toys_can_do on toys USING GIN ((json_column_name->'can_do') jsonb_path_ops);

Same as in Mongo, The postgres index will index everything inside the field, be it an array or an JSON document.

So both databases have good indexing options that do the job.

But:

Trgrm indexes - https://medium.com/swlh/performance-optimisation-for-wildcards-search-in-postgres-trigram-index-80df0b1f49c7 Hypothetical Indexes - PostgreSQL extension to analyze queries and give indexing advise without actually creating the index https://github.com/HypoPG/hypopg

  • Postgres possesses a better more sophisticated query planner using more extensive statistics about the data which utilizes indexes better.
  • Postgres has great index analyzers and advisors - for example , Pg_idx_advisor

Personally I often use the postgres? trigram index extension to index queries such as : Select * from users where name like ‘%abc%’

So Postgres already has about the same as mongo and this can be easily extended further using extensions.

Backup and recovery

Just to clear things - dump and restore (in any database) is not a backup strategy. It is slow both ways (slow dump, even slower restore) and cannot restore to a point in time. So tools such as Mongo’s Mongodump/Mongorestore and Postgres’s pg_dump/pg_restore, are not discussed here.

Now back to backup and recovery. Both databases deliver tools/API for a block level backup and recovery. Block level basically means a data file backup.

As in most databases, Mongo and Postgres must be placed into some “backup mode” while doing a backup. The backup mode will sync dirty data to the disc and will sync all data files to the same state . While the database is in backup mode all the data files can? be backed up as a simple? file level backup.

The main difference is that Mongo must lock the database for changes while the postgres backup mode enables updating the database while doing the backup.

Of course in both databases you can simply shut down the instance and run a full backup of the database (also referred as a cold backup) but this is not desired in production databases as the database is down during the cold backup. This (Mongo locking the database during the backup) is the first main difference between? both databases and is a big win for Postgres.

Another key difference is that postgres delivers incremental backups ( backup of the WALs),? while Mongo’s community version does not. This means that postgres allows you to restore to a point in time while mango does not.

You can acquire the expensive Mongo Enterprise edition or even a more expensive external backup system in order to do a continuous incremental backup but these tools are costly and in most cases hard to set up. Besides, we are comparing apples to apples here. Mongo’s community version VS standard Postgres.

What about backup of a postgres or a Mongo shard cluster?

Postgres delivers the same methods for regular database backup in shard clusters (such as citus database) as well. That means that you can do a full consistent backup of all nodes of the shard cluster from one central location. With Mongo? on the other hand, you will need to do a backup of each replica (which acts as a shard) separately and somehow to coordinate them for the same time in order to obtain a consistent backup of the entire shard cluster.

Postgres is also supported by many backup tools while Mongo's list is much smaller.

A notes:

Mongo community does allow backup of one of the secondaries (and later restore of the primary from it). Postgres does not. With postgres you cannot restore the primary from an incremental (wal) backup of a standby instance.

Despite the last note,I believe Postgres delivers better backup and restore tools and options.

Practical max instance size per node

This is a question that is often asked. How big can a mango or a Postgres single node get without losing performance or other features.

The first limitation would be the max volume size? delivered by the volume manager. This depends on the type of the file system that you're using. Ext4 has a maximum filesystem size of 1EB and maximum file size of 16TB for example. 1 exabyte (1000 petabytes or 1 million terabytes) is quite large and obviously the file system is not the limit of the practical max size that you can use in both postgres? and Mongo.

So the issue is not with the disk space allocation but with allocating more CPU and memory resources to a single machine rather than to different ones.

In most postgres instances that I have worked with I always prefer not to exceed an instance size of 5 TB. In Mongo databases I would set the size even lower not to exceed one terabyte. This is mainly because of Mongo’s single process “mongod” is doing the entire work of the database while in postgres there are different core database processes that split the workload between them. Postgres also allocates a dedicated process to any new session. In Mongo, all client requests are done by the Mongod process.

In both databases, for larger databases I would prefer to use a shard cluster. This way I can extend the database size by volume as well and effectively and cheaply allocate more CPU and RAM and other resources.

As for storage, Postgres is more effective with storage management thanks to its? partitioned tables that allow you to easily get rid of all the data which is no longer required. Mongo does deliver TTL and capped collections as some replacement for a partitioned table. with TTL, Mongo allows you to keep the last X days of data and with capped it allows growth of the collection up to a predetermined max size. Both options are implemented by a background system invoked “delete” statements that are not effective as a “drop partition” statement as in postgres.

The postgres file type is also better suited for large volumes of data. Postgres for example allows you to store large segments out of the table segment while in Mongo everything is stored within the same collection. And lastly - postgres allows using tablespaces that are built on different disks enabling growth beyond the size of a single volume and more importantly, spreading the IO operations between different disks for better IO scalability.

In my view postgres is the winner here.

Sharding?

Sharding is the process of partitioning the data of a single table or an entire database to different nodes. This is unlike regular table partitioning which splits a large table to smaller partitions, all on the same node. Sharding achieves (in most cases) good horizontal scalability on cheap machines. Postgres sharing solutions are PostgresXL,GreenPlum and CitusDB, with the later being the most popular solution today.

Many DBA’s deploy a simple (but effective) poor man’s sharding using the postgres FDW (Foreign data wrappers, essentially database links). All the sharding options above, deliver full ANSI sql and ACID capabilities as the standalone node.

Overall management of the shard cluster in postgres is very much the same as the standalone version.

Monitoring, backup and restore, space management and performance, are derived from those of the standalone core postgres nodes, with exceptions for managing the Shard cluster aspects.

You can set up a postgres Citus cluster on as little as 2 nodes and have a HA production ready setup with 2 standby nodes - total of 4 nodes. Mongo’s sharding is its own built-in shard cluster, Not an external fork or extension. It consists of Mongo’s replica sets (which deliver the high availability), each replica set is called a shard.Joining 2 or more replica sets (or simply shards) will create a shard cluster. In Mongo , same as postgres, there is an extra level of management required by the extra features of the shard cluster but over all, everything else is the same as you would manage a single replica set.

Mongo’s shard cluster is in my opinion fairly easy to setup but required more nodes compared to a postgres shard cluster.

To setup a minimal Mongo shard cluster (for production) you would need:

  • 2 replica sets, each containing 3 data nodes - that is 6 total nodes
  • A config server which is a replica set (but stores only metadata - to actual user data) - this is 3 additional nodes
  • 1 Mongos server

This is a total of 10? nodes, compared to 4 nodes for a healthy postgres shard cluster.

Managing the mongo shard server is anything but simple. There is no unified backup and restore solution for example. Monitoring also becomes a challenge as a practical mongo shard deployment will typically include many shards, each containing at least 3 nodes. You end up with dozens of mongo nodes in a typical shard cluster compared to few in a postgres shard cluster. And then there is the dreaded mongo data balancer, spreading the data evenly between the shards. It is not great and you will be spending a lot of time fiddling with the collection’s partitioning key hoping to help the balancer spread the data effectively between the different shards.

I believe The postgres shard cluster is easier to manage, yielding the same performance and high availability.


Monitoring

Both databases deliver all the CLI,API and metadata tables and collection to properly monitor the database. With postgres there are countless monitoring solutions, free and commercial, built on top of the postgres metadata and deliver excellent monitoring.

The third party monitoring tools list? for Mongo is much shorter.

Mongo does deliver everything you need for monitoring (https://www.Mongo.com/basics/how-to-monitor-Mongo-and-what-metrics-to-monitor ) but it’s going to be harder to set up because there are fewer options and tools for mongo.

  • Connection pooling

As a developer, you may not have spent a lot of time thinking about database connections, often referred also as sessions or backends. A single database connection is not expensive in terms of machine resources, but as things scale up, problems can emerge. This is because each connection consumes machine resources such as memory, CPU and tcp sockets while machine resources are not infinite.

A database connection can be either ACTIVE (currently doing some job) or INACTIVE or IDLE (just connected to the database but not running anything at the moment). The ideal setup would be to have mostly active connections with as little as possible IDLE connections consuming resources (mostly memory) and doing nothing. This is what a connection pool does. A connection pooling solution can help us build more performant applications, especially when we’re operating at scale. Connection pool can be a server side tool or delivered by the driver code. For example, You can achieve postgres connection pooling by installing pgBouncer (server side solution) or use python’s psyCopg’s (driver) api.

Server side connection pooling is preferred as it is ready to install and does not require coding. Server side connection pooling is also preferred as it will support your entire application stack, no matter what programming language you use. And finnaly, these tools are more robust and mature as they are used widely.

Postgres delivers several server side connection pooling solutions with pgBouncer being the most popular, but also pgpool and other tools from cloud providers, aws RDS sqlrproxy for example.

Mongo does not deliver a ready to install tool for connection pooling. In Mongo, connection pooling is delivered by the driver such as spring boot. This means that you often have to code and set up the connection pool as part of your code. I personally prefer a ready to install tool which is not part of the development stack.


Proxies

A proxy is most often used in postgres to redirect the user/application connection request to the primary. As the cluster is constantly monitoring the heartbeat of all its members, switchovers or failovers can happen from time to time, changing the role of cluster members and designating a new primary/leader. Postgres has external tools such as HAProxy and GoBetween for that. Mongo delivers that by the uri connection parameters.

This means that for postgres you need to install and manage an external tool while for Mongo you have it as a built in feature.

HAProxy and GoBetween are extremely configurable but require more expertise. With GoBetween for example you can route connections to the least loaded node. But you will have to do some coding for that. You can do about the same (but much more easily) using the mongo’s read and write concerns. These are simple tag based rules that dictate to which node the request is routed. I prefer the Mongo solution because it easier to setup and so is possibly more robust, and also supported by all major development stacks

Development & administration tools

This includes built in support and tools and also third party tools and development scripting/languages. There are so many great tools for Postgres (as listed in this partial list): (https://www.postgresql.org/download/products/1-administrationdevelopment-tools/ ). Tools such as pgAdmin, Dbeaver and DataGrip that enable you to interact and admin the database. On the other hand I know of fewer similar tools for Mongo that compare with the above ones. Actually when writing this list I was surprised to see the list of tools for Mongo has grown nicely, much more than I remembered it, but still, in my opinion the ecosystem for Postgres is by far bigger and better then Mongo’s.


That’s all folks. I hope you enjoyed this extra long article.

My next article will be a benchmark comparing how cost effective is a cloud database (such as a AWS RDS instance) compared to an on premise postgres instance.

I expect to be surprised.

???? ??? ?? ??????! ??? ????? ???? ?????? ??? ?????? ??? ??????? ???? ????? ?????? ?????? ???? ?????? ???? ????, ????? ????? ?????? ?????? ?????: https://chat.whatsapp.com/BubG8iFDe2bHHWkNYiboeU

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了