Why not ORM's
It's no news that trends in tech often diverge due to performance concerns with certain languages or tools. Just as some developers claim that certain languages are slow, similar debates occur about databases. I've heard it numerous times: "Using ORMs is slow." In the Go ecosystem, for instance, some developers opt for SQLC over GORM for this reason.
Everyone has their reasons, but I've come to realize that the perceived slowness of ORMs is often due to how we use them. Sometimes, we neglect to dig deep into making certain types of queries efficient, or we're not even aware a particular query needs to be optimized, and other times we don't even realize that a particular query could be optimized. Most times "the deadline is the next day lol".
In Django, for instance, developers often shy away from helpers like prefetch_related and select_related. While these have their own advantages and disadvantages, I've realized that the ORM itself is not inherently slow, it's about how we use it, not only ORM but frameworks or languages perhaps.
Here's a quick example to illustrate this point. Recently, I worked with Adefemi Oseni , who identified an expensive operation and helped me understand how to optimize ORM queries properly.
Let's say we want to check the status of users running an agent (machine learning model) and identify those who have surpassed their usage time so we can clear or stop them. Here's an initial approach:
from django.utils import timezone
from django.db import models
from datetime import timedelta
import enum
class StatusChoices(enum.Enum):
running = 'running'
stopped = 'stopped'
created = 'created'
def check_running_servers():
# Get running projects
running_servers = Project.objects.filter(status=StatusChoices.RUNNING)
for running_server in running_servers:
# Get the owner of the project and their timeout value
time_duration = running_server.project.owner.test_server_timeout
timeout_time = running_server.created_at + time_duration
# Check if it has surpassed the current time
if timeout_time < timezone.now():
stop_test_server(running_server.id)
send_email(
"Agent Translation Task Timeout",
f"Your test server for project {running_server.project.name} has timed out",
running_server.project.owner.email
)
Now, let's consider a scenario where we have 1 million records. Are we going to fetch the status for each user every time we call the check_running_status function? What if multiple checks are going on at the same time? Whew! we dont want that. But this looks normal we wouldn't even suspect the code above has any deficiency. This is one of the pitfalls we often fall into.
领英推荐
However, most ORMs provide measures to handle scenarios like this efficiently. Let's refactor this code using Django's annotate to perform the necessary calculations in the database query itself:
from django.db.models import F, ExpressionWrapper, DurationField, DateTimeField
def check_running_servers():
running_servers = RunningProject.objects.filter(
status=StatusChoice.RUNNING
).annotate(
timeout_duration=ExpressionWrapper(
F('project__owner__test_server_timeout') * timedelta(seconds=1),
output_field=DurationField()
),
timeout_time=ExpressionWrapper(
F('created_at') + F('timeout_duration'),
output_field=DateTimeField()
)
).filter(timeout_time__lt=timezone.now())
for running_server in running_servers:
stop_test_server(running_server.id)
send_email(
"Test Server Timeout",
f"Your test server for project {running_server.project.name} has timed out",
running_server.project.owner.email
)
Wow using an ORM provided helper and few othe django classes we perform all the calculations within the database, reducing the overhead on our application server and making the process more efficient.
Let's dig deep. What makes the code more efficient?
The refactored code is more efficient because it reduces the number of database round trips and avoids fetching unnecessary data. In the original implementation, you would have to fetch all RunningProject instances, then iterate over them to retrieve the project.owner.test_server_timeout value and calculate the timeout_time for each instance. This approach would result in multiple database queries (N+1 problem) and potentially fetch more data than necessary. But, In the refactored version, the calculations are performed within the database query itself, leveraging the power of the ORM and the database engine.
To an extent and depending on the application you're working on ORMs are powerful tools that can greatly simplify database operations and improve developer productivity. To get that efficiency you might want, it's essential to understand their underlying mechanisms, learn about the available optimization techniques, and continuously monitor and optimize your queries as your application grows. Recently, we have chatModels (chatGPT, claudeAI etc) that can provide insights and make such research fast. Use it as a guide or manual if google search isn't helping. The key is to explore and understand the tools at our disposal, ensuring we use them to their full potential. This way, we can avoid common pitfalls and create efficient, scalable applications.