# Hybrid Data Stores

If you have been reading this series in order, you likely have thought about applications you may have worked with in the past, and considered how a document structure may look. Perhaps there were many thoughts like "Oh yeah, that makes sense," or "Oh, that would be way easier than...." _(If so, then this series of articles has had its intended effect - so far.)_

Conversely, there may have been other thoughts - things like "We can't do that because..." or "We really need ACID guarantees for \[scenario\]." While some information may work as documents, others cannot. This limiting factor means that the data store will stay with a relational database, in spite of the extra work it takes to maintain a rigid structure throughout the application. Besides, we're already used to it!

If both paragraphs above capture your thoughts, you are in the target market for a hybrid data store. Rather that jettison one data structure and switch to another, we can incorporate the good parts from both in a single data store.

## Relational Data Structures

### Can Be Flexible...

There is a reason that relational data stores are the default representation for data at rest throughout the industry, both professional and hobbyist. Despite its depictions as rigid and structured, it is incredibly flexible as to the rigid structure one can define - and nullable columns and foreign keys provide the ability to model that a relationship may not exist.

Content management systems (CMSs) are the textbook example of how one can build a flexible application using relational tables. Applications such as WordPress and SharePoint are commonplace - you likely know both without any further research - yet both are built on relational databases (MySQL and SQL Server, respectively). As of this writing, WordPress serves just over 50% of the Internet, and SharePoint is the _de facto_ standard in enterprises, serving countless sites both on the public Internet and in private networks (and serves as the backend store for Microsoft Teams' file sharing).

### ...But Can Be Complex

At one point in his career, this author joked about making a table called `the_table` with two columns (`id` - 32-character string - and `the_field`, an unbounded text field), which would have been the ultimate flexible schema for the application in question (which, at the time, used a data model that was neither relational nor document-oriented). This joke was taken as such (and appreciated) because it pointed out that, for all the complains about how inflexible our current data structure was, it _could_ be replaced with something even worse. Even the human body would be reduced to a puddle of goo were it not for the skeletal structure on which our organs rely.

_As segues go - is there any entity more complex than the human body? Thankfully, database structures are well above the level of molecular biology or physiology!_

The two examples above, though, show different approaches to the flexible-relational paradigm. The [WordPress schema][wp-schema] is not huge - 12 tables as of this writing - yet all the content for some major media sites is contained (mostly; plugins can create tables for their own use) in those tables. In the `wp_posts` table, there are:

- Blog posts (initial post, maybe updates, comments, categorizations, etc.)
- Static pages (timeless content that may change infrequently; similar to this page)
- Any other custom content item

The key to this is the `post_type` column, a free-form 20-character field used to indicate what sort of content this "post" represents. But, because of this flexibility, the schema supports data structures that are not valid in the application. Blog posts can have categories (broad topics to which the post applies) and tags (distinct points mentioned in the particular post), and both are stored in the `wp_term_relationships` table. Static pages, also stored in `wp_posts`, have neither of these applied to them, yet there is nothing in the data model that prevents lots of `wp_term_relationships` rows for these as well (which may be valid, if a plugin adds some other way of categorizing pages).

> [!NOTE]
> The purpose of the above is not to rip on WordPress's schema. What that project has implemented with just a few tables, and the way they ensure that major and minor upgrade version are supported, is nigh-heroic. This author's biggest quibble with the WordPress schema has to do with using plural nouns for their tables; it's the "post" table, not the "posts" table, darn it....

If, from a data structure perspective, WordPress is complex - SharePoint blows it out of the water. It uses _so many tables_ - dynamically creating them in some cases - that any analysis and extension is very difficult. And, once an outside observer figures it out, that observer should harbor no reasonable expectation that the structure will not change with the next release.

Perhaps an enterprise-level application that creates sites for an arbitrary number of organizations, with an arbitrary structure and arbitrary content, needs this level of complexity. (Again - no shade on SharePoint here, but no one can claim it _does not_ use a complex schema!) This author suspects that the average reader here does not.

## Document Data Structures

> Complexity is a subsidy.<br>_<small>&ndash; Jonah Goldberg</small>_

The above quote has, admittedly, been yanked from its original context, but it applies here more than we may initially think. The original context refers to government regulations which impose certain burdens on businesses; any legal business must comply with them. As the compliance cost rises, businesses which cannot absorb the overhead of that compliance become non-viable. What may be "budget dust" for a large business may be a cost-prohibitive capital expenditure for a small one. Thus, the regulations end up being a protectionist subsidy for existing businesses.

What does this have to do with databases? Each developer who works on a project has to perform the programmer equivalent of "breaking into the market." (Sometimes, even the original developer has to get back up to speed on what they previously wrote.) Any complexity we can eliminate will make our applications more approachable and maintainable. Every step in a process represents something that can go wrong; avoiding those will make our applications more robust.

> [!NOTE]
> When the relational model was developed, mass storage space was at a premium. As it turned out, structuring data into tables with relationships and non-repeated data is also the most efficient way to store it. Storing documents requires more space, as the field names are stored for each document. Since these are text documents, they compress well; it may not even be something you would notice, but it is worth evaluating.

### Thank You, {vendor_name}

The heading above is rendered correctly. Nearly every relational data store has incorporated a JSON data type; [Oracle][], [SQL Server][], [MySQL][] and [MariaDB][] _(sadly, diverging implementations implemented mostly after the project fork)_, [PostgreSQL][], and [SQLite][] have all recognized the advantages of documents, and incorporated them in their database engines to varying degrees.

> [!TIP]
> As of this writing, PostgreSQL is the winner for document integration. It has two different options for JSON columns (`JSON`, which stores the original text given; and `JSONB`, which stores a parsed binary representation of the text). Additionally, its indexing options can provide efficient document access for any field in the document. It also provides querying options by "containment" (a given document is contained in the field) and by JSON Path (a given document matches an expression). SQLite's implementation was (admittedly) inspired by PostgreSQL's operators.

Thanks to these vendors' efforts, there is a very high likelihood that whatever relational data storage solution you may be currently using may support this hybrid structure today - no upgrades or patching needed!

## Mixing Relational Tables and Documents 

So... how would we tie this all together? The solution sounds simple (and, in some cases, may be) - create tables with document columns for data where the document paradigm fits nicely, while leaving the needs-to-be-relational data in tables. As with many things in the realm of software development, though, a simple idea can lead to a complex implementation.

### Theory

Let's tackle the "simple" part (for no other reason that simplifying the data structure is _the entire point_ of this series). If we have an accounting system with balances, ledgers, debits and credits, etc., we likely have a scenario where we will need a relational, constraint-enforced data store. Customer support calls for this account, though, can vary; they can be general, they can apply to an account overall, or they can apply to a specific transaction. While a relational data store _could_ implement this, a document may be a better choice, particularly for capturing the various calls and actions which may occur over time.

In this scenario, we could have an `account` table along with a `account_transaction` table. Each transaction is tied to an account, as well as the transaction preceding it; we store the "balance forward", the amount of the transaction, and the new balance, as well as a mandatory link to the previous transaction. This prevents miscoded applications (or nefarious database accessors) from removing a large debit transaction, making the account have a higher balance than it should.

We could also have a `support_ticket` table which records communications from the customer from the initial contact through to resolution. We could easily use a document for this, with an array of notes for each communication back-and-forth between the client and the customer. This document could also have an optional link to the account or transaction to which the incident referred, as well as a link to the customer in question.

We - of course - do not want to lose any of that data; however, most of these relationships are optional. What happens if the account is closed - or, beyond that, if a customer is deleted? A purely relational architecture could specifically address this; however, support tickets as documents gives us a different form of traceability by default. We will still have record of the interaction, because the `support_ticket`'s presence did not prevent further business action on the account or customer. At the same time, support tickets did not prevent us from closing an account; the business remained able to take action where needed.

### Practice

As one reviews the APIs in the `Custom` type for any of the projects here, one notices that the document-returning functions take - as their last parameter - a mapping function, which translates between the database row and the expected return item. Each library has predefined functions to return domain items; return a JSON string with matching results; or write the JSON results directly to the output.

As these are designed to expect functions, though, this allows these libraries to be used to not only return deserialized (or raw) JSON documents, but for any required domain item. The function passed to these `Custom` calls can select one field and deserialize it; or, it can pluck various fields from a row result and construct a domain item; or, it can transform these results into some other form.

This enables a true relational-and/or-document (AKA hybrid) data store. Tables against which `Find`, `Document.insert`, etc. are executed are assumed to be document tables, while `Custom` functions/methods allow relational data as well. While not an object-relational model (ORM), writing a to/from mapping for a domain object allows either model to be used in the same data-access paradigm.

> [!NOTE]
> These libraries provide a nice API for these actions - and, of course, are the reason this page exists! However, even if one were to never use any of these libraries, these principles still stand.
 
## Is This Right for Me?

If one has read from the beginning up this point, but is still looking for permission to take the leap - a dry-erase board is your friend. Diagram the tables / documents, brainstorm their interactions, consider the real-world constraints vs. the ones each paradigm lets you model/enforce via the database, and decide from there. (For this author, this is a much simpler data structure which fits all of his side projects perfectly, and one he wishes he could embrace at his more enterprise-y day job.)

Even if the answer is "no," please skim the top part of the next article; some design considerations transcend the document/table decision. In the final article in this series, we will consider the best way to design data structures.


[wp-schema]: https://codex.wordpress.org/Database_Description "Database Description &bull; WordPress Codex"
[Oracle]: https://docs.oracle.com/en/database/oracle/oracle-database/21/adjsn/json-in-oracle-database.html "JSON in Oracle Database &bull; Oracle"
[SQL Server]: https://learn.microsoft.com/en-us/sql/relational-databases/json/json-data-sql-server?view=sql-server-ver16 "JSON Data in SQL Server &bull; Microsoft Learn"
[MySQL]: https://learn.microsoft.com/en-us/sql/relational-databases/json/json-data-sql-server?view=sql-server-ver16 "The JSON Data Type &bull; MySQL"
[MariaDB]: https://mariadb.com/kb/en/json/ "JSON Data Type &bull; MariaDB"
[PostgreSQL]: https://www.postgresql.org/docs/current/functions-json.html "JSON Functions &bull; PostgreSQL"
[SQLite]: https://sqlite.org/json1.html "JSON Functions and Operators &bull; SQLite"