Building Resilient Applications: Leveraging Modern Databases for High Availability
This article builds upon my previous works ([1] ,?[2] ,?[3] ) by showcasing a recent demo of my 'Resilient Simple Application' concept.
Recently,?I collaborated with couple of development teams to address their recurring issues with applications availability and transactions losses.?These were primarily caused by:?
1.???? Application Architecture Deficiencies:
a.???? Lack of High Availability (HA) planning and implementation during deployments
b.???? Improper utilisation of database features for HA and Disaster Recovery (DR)
2.???? Developers’ Knowledge Gaps:
a.???? Misunderstanding of database capabilities for achieving HA
b.???? Limited awareness of recommended practices for database deployment and management
3.???? Reactive approach to end user issues:
a.???? ?Focus on reactive solutions to end-user issues, rather than proactive guidance on HA best practices
4.???? Difficulty influencing large-scale cloud architectures:
a.???? Inability to influence end-to-end, large-scale cloud architectures due to a focus on short-term fixes
This article, along with its accompanying simple demo, aims to bridge these knowledge gaps. It demonstrates how modern database features with high availability capabilities must be tightly integrated with application layer. Only this integration will achieve a reliable interplay between application business logic, its connection approach and advanced functionalities, like intelligent transactions rerouting and effective caching.
The proposed demo emulates a simple cafe website with online ordering and messaging form. To demonstrate that highly available solutions can be quite cost-efficient, I designed this web application as a "monolyth" and wrote it in Python using the Flask microframework and SQLAlchemy open-source SQL toolkit. To host the application code, I used Gunicorn, a Python WSGI HTTP server, behind the trusted and battle-tested Apache HTTPD server. The datastore relies on Amazon Aurora Global Database and Amazon DynamoDB Global Tables - two real keys to the availability castle.
I would like to emphasise the importance of considering already established industry solutions. Today's developers may sometimes overlook the valuable lessons learned from 30 years of "technology legacy," which can lead to unexpected challenges. Not everything can (and should) be architected as microservices and there are countless valid cases for such simple web applications, described in this article. And even if you're not writing in Java EE, your simple Python web application can still benefit from established enterprise application patterns, perfected over decades by countless developers worldwide. Here's an approximate mapping of my simple website's components to the classic "Java Enterprise" stack .
This approach benefits from having a separate layer called the “Unified Data Platform” (UDP) that handles all data access for applications. This layer would act as a mediator between the application logic and various data sources (databases, caches, message queues).
The Importance of “Unified Data Platform” approach is significant:
1.???? Improved Maintainability: A segregated data access layer isolates the application logic from the specifics of each of many data sources. This makes the application code potentially highly available, easier to maintain and update, as changes to data sources wouldn't necessarily require a downtime and changes to the application itself. For example,?the underlying Aurora Database can be patched,?upgraded,?or even completely rebuilt without causing application downtime or transactions loss.
2.???? Enhanced Security: By controlling all data access through a single layer, you can centralise security policies and access controls. This makes it easier to manage user permissions and prevent unauthorized access to sensitive data.
3.???? Flexibility and Scalability: A dedicated data access layer allows you to easily switch between different and multiple data sources without modifying the application logic. This provides more flexibility in terms of data storage and retrieval options and simplifies scaling of the data platform as needed.
Here are some key characteristics of proposed High Availability (HA) solution:
1.????Data Platform Resiliency: Achieving high availability becomes easier and more cost-effective when "protection" mechanisms extend to the application layer. For example, let's consider the Cafe Website application built with standard Python web stack components:
2.????On the Infrastructure side, we utilise trusted software:
These components can be almost 1:1 mapped to the industry "golden" standard of J2EE full stack with proprietary Web Servers and Java App Servers (JBoss, WebLogic, Websphere).
3.????Reconnections logic. My Cafe WebApp (via Rananeeti UDP) uses Pessimistic Disconnect Handling just to test the connection at check-out. The rest is done by the application code.??????????????????????????????????????????????????????????????????????????????????????
The following screenshot showcases a simple test scenario, timestamped in this logfile:
2024-06-15 08:54:54,694 INFO sqlalchemy.engine.Engine select pg_catalog.version()
2024-06-15 08:54:54,694 INFO sqlalchemy.engine.Engine [raw sql] {}
2024-06-15 08:54:54,698 INFO sqlalchemy.engine.Engine select current_schema()
2024-06-15 08:54:54,698 INFO sqlalchemy.engine.Engine [raw sql] {}
2024-06-15 08:54:54,701 INFO sqlalchemy.engine.Engine show standard_conforming_strings
2024-06-15 08:54:54,701 INFO sqlalchemy.engine.Engine [raw sql] {}
Connected. Press a key to start transaction
2024-06-15 08:55:27,875 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2024-06-15 08:55:27,875 INFO sqlalchemy.engine.Engine INSERT INTO cafe.ordersraw (ordrdgst, ordrname, ordrppl, ordrdttm, ordrtxt)
VALUES(%(p_ordrdgst)s, %(p_ordrname)s, %(p_ordrppl)s, %(p_ordrdttm)s, %(p_ordrtxt)s)
2024-06-15 08:55:27,875 INFO sqlalchemy.engine.Engine [generated in 0.00034s] {'p_ordrdgst': '19feb1b0db8f8a8759ad', 'p_ordrname': 'Test Name', 'p_ordrppl': 8, 'p_ordrdttm': '2024-01-25 14:15:06', 'p_ordrtxt': 'This is a test order'}
2024-06-15 08:55:27,875 INFO sqlalchemy.pool.impl.QueuePool Invalidate connection <connection object at 0x7f3881d7f040; dsn: 'user=cafeapp password=xxx dbname=cafedb host=cafe-1.cluster-c8tdsentijjf.us-west-1.rds.amazonaws.com', closed: 2> (reason: OperationalError:SSL SYSCALL error: EOF detected)
... got fatal DB error: SSL SYSCALL error: EOF detected
... waiting 5 sec and reconnecting new pool
2024-06-15 08:55:32,881 INFO sqlalchemy.pool.impl.QueuePool Pool disposed. Pool size: 5 Connections in pool: 0 Current Overflow: -5 Current Checked out connections: 0
2024-06-15 08:55:32,881 INFO sqlalchemy.pool.impl.QueuePool Pool recreating
... recreating new DB pool
2024-06-15 08:55:37,906 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2024-06-15 08:55:37,906 INFO sqlalchemy.engine.Engine INSERT INTO cafe.ordersraw (ordrdgst, ordrname, ordrppl, ordrdttm, ordrtxt)
VALUES(%(p_ordrdgst)s, %(p_ordrname)s, %(p_ordrppl)s, %(p_ordrdttm)s, %(p_ordrtxt)s)
=========> !!! this, cached since <===========================
2024-06-15 08:55:37,906 INFO sqlalchemy.engine.Engine [cached since 10.03s ago] {'p_ordrdgst': '19feb1b0db8f8a8759ad', 'p_ordrname': 'Test Name', 'p_ordrppl': 8, 'p_ordrdttm': '2024-01-25 14:15:06', 'p_ordrtxt': 'This is a test order'}
2024-06-15 08:55:37,907 INFO sqlalchemy.engine.Engine COMMIT
... Success! Transaction completed from 2nd attempt
... The End.
In this initial scenario, a simple Python script played the “App Code” role in the Rananeeti Data Platform, which detected an Aurora Multi-AZ failover. It then reestablished a new connection pool (not individual connections) and successfully completed the transaction in real-time with a user wait time of less than 7 seconds. It follows that most applications, storing data in a database, need to “build around” the central data platform. Focusing early and excessively on UI leads to the situations where data handling, together with security, becomes an afterthought, causing expensive redesigns and architecture deficiencies.
A key takeaway from this test is that leveraging the connection pool at the application layer allows user transaction state to be "cached" and preserved between reconnects. This happens at the application layer, bypassing the more complex and expensive solutions like RDS Proxy or Amazon Aurora. Simply adhering to the "PEP 249 – Python Database API Specification v2.0" may significantly improve your application's resilience in a cost-effective manner, while maintaining a simple and elegant stack.
Now that the transactional layer has been established and validated, let's proceed with testing the Cafe application as a whole software system. The following image showcases the next test case: Cafe application resilience during an Aurora Database failover across Availability Zones. Our application successfully recreated the SQLAlchemy connection pool and completed the transaction using the database replica, which was just promoted to the primary writer instance of the Aurora cluster. This resulted in no impact on the user's web session and eliminated the need for them to manually repeat the transaction.
Database maintenance, such as instance patching or cluster upgrades, can sometimes cause delays. Fortunately, these are typically resolved using the same approach as above. By leveraging Aurora Global Database, our application can simply (re)create a connection pool using Aurora instances from another currently available region that isn't undergoing maintenance. However, what happens if our database becomes unavailable in both regions? In this scenario, the Rananeeti Data Platform's "backend code" will redirect transactions and store them in Amazon DynamoDB Global Table.
In this case there will be a delay with user’s order processing but still no need to manually repeat the transaction.
Once the database becomes available again in either region, the Rananeeti "Transactions Queue Code" will automatically detect this and process all cached transactions.
The only remaining question is how to notify the user that their order has been successfully placed, even though there may be a delay. We'll demonstrate this during our final two full end-to-end tests.
Full MultiAZ Failover test
As the name implies, we repeat the “database instance patching” or similar outage scenario and observe how our Application will react. We expect that user orders will still be completed in real time, just with short delay at order webpage and no rerouting.
The full Multi-AZ failover test successfully completed in under 9 seconds, as expected. In previous tests, we've even observed completion times as low as 6 seconds. This demonstrates the application's ability to efficiently handle database downtime caused by failovers during patching.
Final Test: Full DB Outage
And the final scenario to check will be a full database outage. For this use case I captured full application screen and related parts of backend log.
The logs indicate that the Rananeeti "Transactions Queue Code" removes completed orders from the transaction cache after processing. It's important to run the "Transactions Queue Code" as a systemd-controlled daemon. This approach provides finer control over its resource consumption through cgroups and other related Linux OS features.
And here's a final benefit to consider:?the Rananeeti Data Platform is designed to ensure that each transaction is entered only once, even during failovers and outages. This eliminates the risk of duplicated order entries. Additionally, if a user attempts to re-submit the same order, our application will intelligently identify this scenario and prevent double bookings (and preserve the database integrity).
The base for these benefits has been established by a critical architecture decision: building our Unified Data Platform first. That core platform then served as a solid foundation upon which we could develop and integrate other application functionalities.
By adopting a proactive approach to high availability and seamlessly integrating modern database features into the application's design and infrastructure, we empower developers to craft applications that are not only functionally robust but also remarkably resilient. This translates to a significantly enhanced user experience with minimal disruptions, making long-term application success possible.
Are you interested in improving the uptime and reliability of your business systems??
I may be offering soon a workshop on application resiliency that may help you achieve just that. In this workshop, we'll explore strategies and techniques to ensure your applications remain available even during outages or failures.
While I won't be directly sharing my Cafe Application's source code, I may leverage the principles behind its resilient architecture to tailor the workshop to your specific needs. We may attempt to identify potential vulnerabilities in your systems and develop a plan for enhancing their availability.?
Benefits of attending:
Feel free to reach out if you have any questions or are interested in learning more about this workshop.
References:
1.???? Building a Cloud Fortress