Got a new MVP web project coming up? What you really need is a semi-relational semi-normalized semi-structured document-store database!

Got a new MVP web project coming up? What you really need is a semi-relational semi-normalized semi-structured document-store database!

In almost every web application for the past 3 decades or more, developers have gone through the following (or similar) thought process:

  • Which relational database should I use and how will I manage my schema changes, migrations, etc
  • Which ORM should I use to make my life easier?
  • Where will I host my database?
  • Do I use an on-prem database server? Cloud provider? Serverless ?
  • Do I need middleware to connect my application to my DB?
  • How will I integrate my ORM and schema changes into my CI/CD
  • How will I replicate my database into a development/staging/test environment
  • How will my handle backups
  • etc

After answering these questions, the next predictable step for most developers will be to build some sort of CRUD scaffolding on top of the ORM within their application development framework.

Inevitably as their app starts to rapidly grow, so do the number of schema changes. As fields get added and dropped, more data needs to be migrated, more seeds need to be created, more CRUD endpoints need to be updated, etc, etc.

This is all too much. More often than not, we want to be able to launch a minimal viable product fast and validate/fail just as fast. We should probably not be spending more than 2 minutes of thought on what, how, and where we will store our data in 2024. Not because we don’t care but because it’s been way too many decades and this crap should be a lot easier and more straight forward by now.

?? Tangent Warning

And no, I will not entertain your ‘cloud-based-serverless-no-code-ai-driven-buzztastic’ database startup (hint: vendor lock-in and de-platforming risks). We need a quick, repeatable, framework independent, and open-source way of connecting an app/web app to a database without all of the bullshit. We want to be able to quickly iterate our apps and make schema changes across team members without clunky migrations and for the love of god.. please… we do not want to use ORMs. Dammit! ??

?? End Tangent


So what I will try to do in today’s post is demonstrate that for most small(ish) sized apps, we can make our lives significantly easier by simply using an unstructured+relational paradigm on top of SQLite. And, to stick with the trend-du-jour, I shall also point out that obviously there is nothing more serverless than SQLite ??.

I’ve said too much…more on SQLite later.

Why not MongoDB and Document Stores

At first you might read my introduction and say “Well, you just described a document store”. This is partially true. Document Stores like MongoDB definitely provide the flexibility of not needing to do schema migrations on top of unstructured data however Mongo and similar projects often lack features like joins, create lots of duplication, and tend to consume large amounts of ram. Many of them (Mongo included) do not natively support SQL and they require running a server, cluster, or cloud offering.

What do we actually want

By ‘we’ I’m referring to my personal experience of building countless web applications over the last 20+ years with very diverse teams and methodologies. Your requirements may differ slightly but hopefully these ‘wants’ are general enough for most web-app use cases:

So, at a high level we want:

  1. The ability to work in our language of choice: SQL
  2. The ability to work quickly and easily with teams without migration headaches.
  3. The ability to have semi-normalized data (high level relationships between tables)
  4. The ability to JOIN tables in queries like relational databases give us
  5. The ability to have a mix of structured/unstructured data
  6. The ability to add indexes and optimize queries
  7. The ability to segment data and provide isolation when needed
  8. The ability to quickly snapshot, backup, restore our database
  9. Straight forward integrations to our CI/CD processes
  10. Reasonably fast query responses.

So obviously we want a semi-relational semi-normalized semi-structured document-store database ????

In all seriousness though, many databases on the market can achieve most of these requirements but few can check all the boxes. There is however one O.G. database that has secretly and silently become ubiquitous in our lives and is a core component in technologies like Android, Google Chrome, Firefox, Airbus airplanes, etc that has the ability to check all of the boxes: SQLite.

The Core Architecture Decisions

So how will we construct this magical developer friendly database experience? Let’s begin by defining a few core decisions that we want to stick with:

Decision 1 - As mentioned, we will use SQLite to accomodate a modern application architecture.

  • Serverless - SQLite doesn’t run as a ‘server’, it is simply a file, like your excel sheet or word doc. This means the database is portable and can run anywhere.
  • Data Isolation - You can segment your data by customer, project, etc simply by creating a new SQLite file
  • Edge Friendly - You can store read-only copies of your SQLite db on the edge, in the user’s browser, etc. Tools like https://litestream.io automate continuous replication to S3 compatible object stores.
  • CI/CD Friendly - Copy your SQLite file and you have an immediate restore point, staging copy, etc
  • JSON - There is rich JSON support.
  • Built in Snapshotting and Backups - Both the Online Backup API and VACUUM INTO features of SQLite provide fantastic out-of-the-box capabilities.
  • Support - SQLite has been around for about 24 years. To say there is a lot of tooling and support out there is an understatement.

Decision 2 - ORMs are lame, so we won’t use ORMs. Here’s why:

  • ORMs often build inefficient queries
  • ORMs often introduce their own rules and table naming conventions
  • ORMs often times get used just to ease the pain of schema migrations which in themselves cause as many issues as they solve
  • ORMs often introduce their own query language because we think this solves SQL injection… ??

Decision 3 - Schema Migrations Suck(most of the times) so we want something better:

  • Adding a new table or a simple column to our data model should not require scheduling a schema migration in a rapid development / MVP environment.
  • In an ideal world, schema changes should never modify existing data or accidentally drop entire columns of data ??
  • Schema migrations can slow down teams who are all working on the same application and require coordinated schema changes.
  • Rolling back schema changes are not always possible without data loss

Decision 4 - Entity Relationships should be allowed to exist but not at the database level.

  • Foreign Keys mean schema maintenance
  • Joins are important and developers know how to use them but we need a better way than constantly adding foreign keys and join tables.

Decision 5 - Keep it Simple

  • There are endless discussions on database design and architecture and certainly the CQRS + Event Sourcing patterns are inspirational but we want a quicker, repeatable way to deploy basic web apps.

Disclaimer - One size does not fit all.

  • Scalability - Most web apps, enterprise or hobbyist, are not used by thousands of users simultaneously. If your situation calls for HUGE data, and TONS of simultaneous users from day 1, this may not be the approach for you. That being said, the SQLite website (https://www.sqlite.org/ ) uses SQLite itself, it handles about 400K to 500K HTTP requests per day, about 15-20% of which are dynamic pages touching the database.? Scaling reads/writes in SQLite is a bit outside of the scope of this article but it is certainly possible with the right architecture.
  • Security - Someone convinced you row level security is a good idea and you absolutely need it, or likewise, you need your DBMS to be your authentication/authorization system or consume/issue JWTs or other non-dbms security things then, again, this will not be a good solution for you.
  • Stored Procedures - Since SQLite is just a file, there is no running process to handle all aspects of traditional DBMS stored procedures . That being said there is basic trigger support and you could write your own similar feature/functions directly in your application if needed.
  • Unstructured data still needs maintenance - Extremely large applications or applications that have long term mutations of table/column structure can become chaotic and messy if left in an unstructured state. Just because we’re building on top of unstructured data doesn’t mean you shouldn’t maintain a clean house.

Building a Document Store Like Experience on Top of SQLite

As it turns out that we actually need to do very little to accomplish our goals with SQLite! In fact, to achieve our goals of being semi-relational, semi-structured, semi-normalized we start by adopting an extremely basic database schema, and from there on out we rely on the built SQL functions offered by SQLite.

Our simple, no-fuss schema:

From a database schema perspective, we need to create a table like we would in any relational model with one exception - our tables will only need two fields: ID and data

In other words, each table is simply a set of rows that:

  • Have one singular column for data called data which is a BLOB field that will store our data in JSON format (One JSON document per row)
  • Has a primary key called ID used for referencing the document
  • Does NOT have foreign keys but still supports JOINs. (more on that later).
  • Can contain indexes to be built using JSON extraction on top of existing keys in or JSON data

Here’s a visual representation of what a database using our new sql lite powered document store methodology might look like:


Pretty simple right? What we’re mimicking here is the noSQL concept of ‘collections’ versus tables but don’t worry, we’re not going to lose too much relational functionality by doing this!

Let’s see how this plays out!

An overview of JSON functions in SQLlite

Let’s review some basics of working with JSON in SQLite.

First, you will notice that I choose the BLOB type when defining the data field in our schema vs just TEXT. This is because of a fantastic recent contribution to SQLite called JSONB.

To quote the SQLite website directly:

Beginning with version 3.45.0, SQLite supports an alternative binary encoding of JSON which we call "JSONB". JSONB is a binary format that stored as a BLOB.

In other words, we can store regular string-based JSON documents in a more optimized binary encoding into our data BLOB fields and SQLite will handle the entries just like regular text based json which will make working with JSON data much faster.

Additionally, like many modern databases, SQLite ships with a set of over 30 scalar, aggregate, and table-valued functions that makes working with JSON truly pleasurable and simple.

?? If you’d like to follow along and try the following examples yourself, you will need the?SQLite CLI?≥ v3.45.0 installed. SQLite comes packaged with some operating systems such as macOS but you may need to install it separately.


Let’s fire up a new sqlite db:

sqlite3 movies.db        

?? This will open the SQLite command prompt and now we can execute SQL commands.


Example of creating a table:

CREATE TABLE movies (
    -- The primary key
    ID INTEGER PRIMARY KEY AUTOINCREMENT,
    -- A BLOB that must be JSONB
    data BLOB CHECK(json_valid(data, 0x08))
);        

Confirm the table exists

SELECT name FROM sqlite_master WHERE type='table';        

Example of inserting data:

-- Insert out JSON document into the data column in the Movies table
INSERT INTO movies (data) values
(jsonb('{"Name":"Forgotten in the Planet","Year":1970,"Genre":["Comedy","Crime"],"Director":"Henrie Randell Githens","Cast":["Adrian Gratianna","Tani OHara","Tessie Delisle"],"Runtime":90,"Rate":7.0}'))
        

Now let’s select that data

-- Get all Name values from the data JSON document in table movies
select data->'Name' from movies;
-- or using js path expression
select data->'$.Genre[0]' from movies;        

Now let’s update our Movie

UPDATE movies SET data = jsonb_replace(data, '$.Name', 'My Amazing Movie') WHERE data->>'$.Name' = 'Forgotten in the Planet'        

?? Notice the shorthand operator ->> in the where clause differs from our select. The only difference is that if the left-hand side is not well-formed JSON, the ->> raises an error whereas the -> operator simply returns NULL. Depending on your use case, its usually ‘safer’ to use ->>


Handling JOINS and table relationships without foreign keys

Previously I stated that we will not be adding any Foreign Keys to our tables. Relational DB purists have probably scoffed at least several times now but remember, for our methodology we want to avoid doing anything directly in the database that would require working with migrations. Foreign keys , by design, create strict constraints that over time make migrations more difficult. That’s not to dismiss the many benefits they provide, we’re just trying to find a simpler way of doing things at the cost of giving up a few things.

We of course, still JOIN tables using SQL without the use of foreign keys. Let’s take a look at the following example of joining two regular relational tables without foreign keys but instead using the WITH statement:

CREATE TABLE customers (
    customer_id INTEGER PRIMARY KEY,
    name TEXT
);

CREATE TABLE orders (
    order_id INTEGER PRIMARY KEY,
    customer_id INTEGER,
    order_date DATE
);

-- Query using a WITH statement to enforce the relationship
WITH customer_orders AS (
    SELECT customers.name, orders.order_date
    FROM customers
    JOIN orders ON customers.customer_id = orders.customer_id
)
SELECT * FROM customer_orders;        

This is no different in our document store model. We can use JSON functions to accomplish joins using fields in our data field JSON documents!

Let’s create a table called actors which has a JSON document with a field called movies that defines a relationship between actors and movies:

Create Actors

CREATE TABLE actors (
    -- The primary key
    ID INTEGER PRIMARY KEY AUTOINCREMENT,
    -- A BLOB that must be JSONB
    data BLOB CHECK(json_valid(data, 0x08))
);        

Let’s add some actor data. Notice they movies key contains actor-movie relationships:

INSERT INTO actors (data) values
(jsonb('{"Name":"Jane Doe", "movies":[1]}'));
INSERT INTO actors (data) values
(jsonb('{"Name":"John Smith", "movies":[1,2]}'));        

Now let’s join it all together!

-- First create our actor_movies relationship 'table' by splitting 
-- on the movies array in the movies JSON field
WITH actor_movies AS (
    SELECT
        a.ID,
        a.data,
        json_each.value AS movie_id
    FROM
        actors AS a,
        json_each(a.data->>'movies')
)
-- Now do a normal join.
SELECT
    am.data->>'$.Name' as actor_name,
    m.data->>'$.Name' AS movie_name
FROM
    actor_movies AS am
JOIN
    movies AS m
ON
    am.movie_id = m.ID;
        

?? Notice what we’ve done here - we’ve taken unstructured data and built relational dependencies on top of the JSON documents themselves, and, because of SQLite’s fantastic JSON features we can still treat the unstructured JSON as table ‘columns’ . This isn’t anything ground breaking or SQLite specific, many databases have similar JSON functions, but the important thing to notice is how we’re choosing to maintain a relational model on top of a single JSON field.


Let’s say you need to add a new JSON style foreign key to your actors documents in the future. You can easily do this using a query that looks like this:

UPDATE actors
SET data = jsonb_set(data, '$.shows', json('[1]'));        

Enforcing Referential Integrity with Triggers

Aside from easy joins, foreign keys also typically provide us with referential integrity. While this can be enforced at the application layer, sometimes we still want some basic referential integrity to happen inside of our database and with SQLite triggers, we can accomplish this without having to touch our table schemas.

Having integrity rules split from table definitions feels more natural and allows our rules to be easily adjusted/modified as our application grows. However, I still maintain that the majority of this type of logic should live outside of your DBMS. Even though triggers don’t require schema migrations, you will need to introduce a more rigid deployment process when adding/removing triggers and with too many triggers things can become messy.

That being said let’s use our ongoing example of Actors and Movies, and we want to ensure that no one inserts a JSON document containing values in the movies key that do not exist as actual Movie IDS in the Movies table. In other words, don’t allow actors to be associated to movies that don’t exist. Here’s how we can use a SQLite BEFORE INSERT OR UPDATE trigger to accomplish this:

CREATE TRIGGER validate_actor_movies
BEFORE INSERT OR UPDATE ON actors
FOR EACH ROW
BEGIN
    -- Check if there are any invalid movie IDs in the actor's data
    WITH movie_ids AS (
        SELECT value AS movie_id
        FROM json_each(jsonb_extract(NEW.data, '$.movies'))
    ),
    invalid_movies AS (
        SELECT movie_id
        FROM movie_ids
        WHERE movie_id NOT IN (SELECT ID FROM movies)
    )
    SELECT CASE
		    -- Raise an error when we have a failure.
        WHEN EXISTS (SELECT 1 FROM invalid_movies) THEN
            RAISE(ABORT, 'One or more movie IDs in actor.data->"$.movies" do not exist in the movies table')
    END;
END;
        

Now if we try to insert an actor which belongs to a movie that does not exist, SQLite will throw an error!

The same goes for cascading deletes

Just like with our insert trigger above where we enforce referential integrity on inserts and updates, we can similarly use triggers to cascade deletes!

Let’s say we want to remove all actors who were associated with a particular movie when that movies is deleted:

CREATE TRIGGER cascade_delete_actors
AFTER DELETE ON movies
FOR EACH ROW
BEGIN
    DELETE FROM actors
    WHERE ID IN (
        SELECT a.ID
        FROM actors AS a,
        json_each(jsonn_extract(a.data, '$.movies')) AS je
        WHERE je.value = OLD.ID
    );
END;
        

If you want to see all the triggers you’ve created to date for a particular table you can run:

SELECT name
FROM sqlite_master
WHERE type = 'trigger' AND tbl_name = 'movies';

-- or to see all triggers:
SELECT name
FROM sqlite_master
WHERE type = 'trigger';
        

Don’t want a trigger anymore - simply delete it:

DROP TRIGGER IF EXISTS cascade_delete_actors;        

Indexing JSON fields for faster queries.

With traditional database tables we can create indexes to speed up our queries. The nice thing about SQLite is that you can still build these indexes on JSON fields! Let’s assume we have tons and tons of Movies and we want to build an index on the Year field

CREATE INDEX movie_year_idx ON movies(data->>'Year');        

Now let’s select something from the movie’s table and see if our index is being used - yep, sure enough when we select the JSON key Year, our select queries are now optimized by the index!

EXPLAIN QUERY PLAN SELECT * from movies where data->>'Year'='1970'

Results:
id	parent	notused	detail
3	0	0	SEARCH movies USING INDEX movie_year_idx (<expr>=?)
        

Need to kill this index at a later point in time?

drop index if exists movies.movie_year_idx        

Conclusion: Yes, I realize this sounds absurd.

It is not lost on me that what I’m describing is essentially the reconstruction of built in relational database features on top of a non-relational and unstructured database schema. In fact, that is kind of the point here. You see, my entire premise of this thought experiment was to try to find a way to be able to develop an MVP quickly using the benefits of NoSQL document stores while still being able to have some of the great features a DBMS offers us. In the normal relational world, adding or changing foreign keys can have significant impacts to the availability and integrity of data but abstracting some of those basic features using the described techniques allows us to add and remove them without disrupting the underlying data or schema. It also means that we can create multiple versioned copies of our relationships and allow them to co-exist if needed. Most importantly, our developers can quickly add new columns/fields to tables without needing to create significant migrations.

There are of course also obvious downsides and as I pointed out this was mostly a thought experiment. As with any process, its important to set some ground rules that your team agrees on when managing your data storage system and in this case, those rules will operate much more on good faith than through programmatic enforcement. YMMV.

Part 2: Implementing this approach in code

In my next installment I will attempt to show some of the benefits of using the proposed approach in a real world web app and hopefully demonstrate that its probably ok to throw that ORM you love so much to the curb :) . But that’s for another day.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了