parent
3034c66f28
commit
4573eb6771
1
.gitignore
vendored
Normal file
1
.gitignore
vendored
Normal file
@ -0,0 +1 @@
|
||||
_site
|
13
.idea/.gitignore
generated
vendored
Normal file
13
.idea/.gitignore
generated
vendored
Normal file
@ -0,0 +1,13 @@
|
||||
# Default ignored files
|
||||
/shelf/
|
||||
/workspace.xml
|
||||
# Rider ignored files
|
||||
/.idea.relational-documents.iml
|
||||
/modules.xml
|
||||
/projectSettingsUpdater.xml
|
||||
/contentModel.xml
|
||||
# Editor-based HTTP Client requests
|
||||
/httpRequests/
|
||||
# Datasource local storage ignored files
|
||||
/dataSources/
|
||||
/dataSources.local.xml
|
8
.idea/indexLayout.xml
generated
Normal file
8
.idea/indexLayout.xml
generated
Normal file
@ -0,0 +1,8 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<project version="4">
|
||||
<component name="UserContentModel">
|
||||
<attachedFolders />
|
||||
<explicitIncludes />
|
||||
<explicitExcludes />
|
||||
</component>
|
||||
</project>
|
6
.idea/vcs.xml
generated
Normal file
6
.idea/vcs.xml
generated
Normal file
@ -0,0 +1,6 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<project version="4">
|
||||
<component name="VcsDirectoryMappings">
|
||||
<mapping directory="" vcs="Git" />
|
||||
</component>
|
||||
</project>
|
@ -1,3 +1,5 @@
|
||||
# relational-documents
|
||||
# Relational Documents
|
||||
|
||||
Parent documentation for projects that facilitate storing documents in PostgreSQL and SQLite
|
||||
This repository contains the source files for the landing page which covers all Bit Badger Solutions-produced document libraries.
|
||||
|
||||
Library-level documentation and API docs will be deployed in subdirectories of this site, although the hand-written documentation will be in each library's repository, and the API docs (JavaDoc, PHPDoc, etc.) will be extracted and generated by the appropriate build tools.
|
||||
|
BIN
bitbadger-doc.png
Normal file
BIN
bitbadger-doc.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 12 KiB |
97
concepts/a-brief-history-of-relational-data.md
Normal file
97
concepts/a-brief-history-of-relational-data.md
Normal file
@ -0,0 +1,97 @@
|
||||
# A Brief History of Relational Data
|
||||
|
||||
## Relational Data
|
||||
|
||||
Relational databases were not the first data structure, but when people talk about document databases, their behavior is usually contrasted with relational databases. We do not need a PhD-level knowledge of these databases, but some high-level concepts will serve us well.
|
||||
|
||||
The "relation" in "relational database" names the concept that instances of data can be linked (related) to other pieces of data. Think of a system that keeps track of books for a library; some items they would need to track the book's title, its author, how many copies they have, and who has a copy checked out. If they were to make a new "book" entry for each physical copy, they would end up repeating nearly everything. With a relational database, though, we could structure tables where very little would be duplicated.
|
||||
|
||||
- An `author` table could hold the author's name, plus biographical information, etc.
|
||||
- A `patron` table could hold the library cardholder's information
|
||||
- The `book` table could have the name of the book and how many copies the library owns
|
||||
- A `book_author` table has the ID of a book and the ID of an author
|
||||
- A `book_checked_out` table could have the ID of a book, the ID of a patron, and the return date
|
||||
|
||||
In this example, we have 5 tables to hold the information, and two of those are there solely for the purpose of associating entities with each other. When we think about this structure, there are some interesting ways of parsing the data that weren't covered by the description above.
|
||||
|
||||
- Books can have multiple authors; this structure also provides an easy way to find books that an author has written.
|
||||
- We can count the occurrences of a book in the `book_checked_out` table and subtract that from the copies we own to determine how many copies are available for check out.
|
||||
- We can easily track what book a single patron has checked out.
|
||||
- If an author's name changes, when we update the `author` table, the system picks up the new name.
|
||||
|
||||
Notice the word "could" in the descriptions of the tables; there are different ways to define relations among entities, and database purists could also come up with scenarios that this structure does not cover. The intent here is to present an isolated yet non-trivial working example that we can use as we think through how this data is structured.
|
||||
|
||||
## The ORM Bridge
|
||||
|
||||
In high-level programming languages, developers create structures with data representing some entity. Most database drivers deal in tables, rows, and columns. If we were to use one of those libraries, we would end up writing several queries and lots of property setting to constitute a domain object from the relational data in our tables.
|
||||
|
||||
An Object-Relational Mapper (ORM) library helps to translate between these structures and the relational database. There are lots of these tools, and they have some pretty solid up-sides in most cases. Take, for example, C# objects and Microsoft's Entity Framework Core (EF Core).
|
||||
|
||||
```csharp
|
||||
// This is the definition of a book
|
||||
public class Book
|
||||
{
|
||||
public long Id {get; set;} = 0L;
|
||||
public string Title {get; set;} = "";
|
||||
public int CopiesOnHand {get; set;} = 0;
|
||||
}
|
||||
|
||||
// This is how we retrieve a book
|
||||
// - ctx is an open database context (connection)
|
||||
// - theId is the variable with the ID of the book we want to retrieve
|
||||
var book = await ctx.Books.FirstOrDefaultAsync(book => book.Id = theId);
|
||||
```
|
||||
|
||||
This illustrates the simplicity of code using an ORM library. Rather than create a query, execute it, check to see if anything was returned, then assign each column from the result to a property in the class - it's the single line above. (If no `Book` for that ID exists, the `book` variable will be `null`.)
|
||||
|
||||
That's great for the book information, but that's not all we need to access; we need authors for the book and we need patrons to check them out. We're not trying to become EF Core experts, but adding this information looks something like this...
|
||||
|
||||
```csharp
|
||||
public class Author
|
||||
{
|
||||
public long Id {get; set;} = 0L;
|
||||
public string Name {get; set;} = ""; // naive
|
||||
// dates of birth/death, bio, etc.
|
||||
// A "navigation property" to find books by author
|
||||
public ICollection<Book> Books {get; init;} = new List<Book>();
|
||||
}
|
||||
|
||||
public class Patron
|
||||
{
|
||||
public long Id {get; set;} = 0L;
|
||||
public string Name {get; set;} = "";
|
||||
public string Phone {get; set;} = "";
|
||||
// skipping the navigation property here
|
||||
}
|
||||
|
||||
public class CheckOut
|
||||
{
|
||||
public long BookId {get; set;} = 0L;
|
||||
public long PatronId {get; set;} = 0L;
|
||||
public DateTime ReturnDate {get; set;} = DateTime.Now;
|
||||
// "Navigation properties"
|
||||
public Book Book {get; set;} = default!;
|
||||
public Patron Patron {get; set;} = default!;
|
||||
}
|
||||
|
||||
// A new Book class
|
||||
public class Book
|
||||
{
|
||||
// properties as before, then...
|
||||
// ...more navigation properties
|
||||
public ICollection<Author> Authors {get; init;} = new List<Author>();
|
||||
public ICollection<CheckOut> CheckOuts {get; init;} = new List<CheckOut>();
|
||||
}
|
||||
```
|
||||
|
||||
Notice that the `Author` has a collection of `Book`s, and a `Book` has a collection of `Author`s. This is how the `book_author` table is represented. For checked-out books, we actually have a type that links a `Book` to a `Patron` and also stores the return date. EF Core's "navigation properties" are how it exposes these other entities within .NET code. If we do that same simple `Book` retrieval command from the first example, we can now traverse these properties to display the author's name, who has books checked out, and when they're due back.
|
||||
|
||||
## Referential Integrity
|
||||
|
||||
The term "referential integrity" is used to describe how these relations are kept in sync. In our example above, we wouldn't want a `book_author` record pointing to either a book or an author that does not exist. We would not want to allow a patron to be deleted while they still had books checked out. (And, we may not want to allow them to be deleted at all! If we need to keep records of who checked out what book, when, and when they were returned, we would lose those records if we deleted the patron.)
|
||||
|
||||
This is an area where relational databases excel. The IDs we have in some tables that point to tables where rows have that ID are called foreign keys. Relational databases allow us to define foreign keys that must exist; ones that must exist or can be missing; and what should happen when the parent record is deleted. To accomplish this, indexes will be applied to these key fields; this not only lets the integrity checks happen quickly, these can also be used to create queries to join the information efficiently.
|
||||
|
||||
## So... What's the Problem?
|
||||
|
||||
There are no problems, only trade-offs. Let's look at how we get to documents from here.
|
71
concepts/application-trade-offs.md
Normal file
71
concepts/application-trade-offs.md
Normal file
@ -0,0 +1,71 @@
|
||||
# Application Trade-Offs
|
||||
|
||||
## Working with Domain Objects
|
||||
|
||||
### ORMs
|
||||
|
||||
When we first started, we mentioned Object-Relational Mapper (ORM) tools. They keep developers from having to write a lot of boilerplate code for every object and data item the application uses. Many of them provide the ability to track changes to the objects it returns, allowing the application to retrieve the object, update some properties, then tell the database to save changes; the tool determines the SQL statements needed to persist that change. It can handle changes, additions, and deletions for many objects in one request.
|
||||
|
||||
ORMs can also be safer. [SQL injection attacks][inject], as part of injection attacks in general, were ranked by the [<abbr title="Open Web Application Security Project">OWASP</abbr> Top 10][owasp] at #1 in 2017 and #3 in 2021. An ORM tool will use parameters to safely pass field content in queries. They make the right thing to do the easy thing; this is its default behavior, and it has to be bypassed (in part or in whole) to write a vulnerable query.
|
||||
|
||||
The downside is that certain data structures do not translate well to relational databases. Consider a blog post with tags. In a relational database, the proper way to store these would be in a `post_tag` table, where the post's ID is repeated for each tag. Retrieving a post and its tags require multiple queries (or a nested subquery which could translate tags to a comma-delimited string). The ORM must translate this when the object is retrieved, and must decompose the domain object into its rows when updating data. _(Some relational databases, most notably PostgreSQL, do have an `ARRAY` column type; in this case, that would simplify this scenario.)_
|
||||
|
||||
### Serialization
|
||||
|
||||
In general, "serialization" refers to the process by which a domain object is converted to a program-independent format; objects can be recreated by "deserializing" the output from a former serialization. (This used to be mostly text formats - plain text, <abbr title="Extensible Markup Language">XML</abbr>, <abbr title="JavaScript Object Notation">JSON</abbr> - but can also be done used with common binary formats.) The rise of the Single Page Application (SPA), where data is exchanged using JSON, means that the majority of serialization is happening to and from JSON. As document databases store their documents in JSON form (or a binary version of it), we can build on this to get domain objects into and out of a document database.
|
||||
|
||||
If an application has JSON serialization strategies already defined for use in a SPA, these same strategies can (generally) be used with a document database. In some scenarios, this may completely eliminate a deserialization/serialization sequence (from the database to the application, then from the application to the browser); select the JSON from the database - and, for multiple results, combine the rows with `,` wrapped with `[` and `]`. Now, the server is passing text vs. getting text, making an object, then turning that object back into text.
|
||||
|
||||
One big downside, when contrasted with ORMs, is that document database drivers are likely not going to have the "update the object and save changes" paradigm of those tools. Document databases do not support row locking and other features ORMs use to ensure that their change-aware objects can be persisted.
|
||||
|
||||
Another difference is that an "update," in document database terms, usually refers to replacing an entire document. The only property which cannot be changed is the ID; any other property can be added, updated, or removed by an update. To partially update a document, the term is "patch." A patch specifies the JSON which should be present after the update, and search criteria to identify documents to be patched. (Patching by ID is fine.)
|
||||
|
||||
> [!WARNING]
|
||||
> This is where some consistency risk increases. There is no currency check on what the value was when it was obtained, and even if some other process has patched it, the patch command will not be rejected. While patches that use commands like "increment" or "append" (provided by some document databases) will succeed as expected, others may not. Imagine a library patron whose last name was entered as `Johnsen`. The patron goes to the check-out desk to have it updated to `Janssen` - but someone in the back was doing a quality check on new patrons, and decided to "fix the typo" by correcting it to `Johnson`. If both these updates happened after users had retrieved `Johnsen`'s record, the last one to press "Save" would overwrite the previous one.
|
||||
|
||||
### Overfetching
|
||||
|
||||
Overfetching is retrieving more data from the database than is necessary for the process being conducted. It gets its own subheading because it is a problem common to both ORMs and document databases (although, in theory, the larger a document becomes, the more the problem is exacerbated). Depending on how much extra data is being returned, it may not be an issue in practice. In most cases, though, considering overfetching should likely be postponed until the process works, but no later. When your little hobby site catches fire, what used to run great can be brought to its knees retrieving data and populating objects for no reason at all.
|
||||
|
||||
Consider the admin area for a blog, where a list of posts is displayed. The list probably needs the title, author, status, and published date, but little else. We do not need categories, tags, comments, or even the content of the post.
|
||||
|
||||
Both ORMs and document databases usually provide "projections" (even if they use a different term for it). A projection is most simply understood as a different view of the data. For an ORM, perhaps we create a `post_list` view, and a domain object for those items; we can now query that view and only select the data we need. For document databases, some have some form of "without" statement that will exclude items from the document. (The application needs to handle these being missing.) Documents may be able to be patched in the result, leaving the database itself unchanged. (The query, paraphrased, is "give me the `post` document patched with the JSON object `{"tags": [], "text": ""}`." The document returned would have an empty list of tags and no text.)
|
||||
|
||||
## Working with Data
|
||||
|
||||
Documents do not have to be turned into domain objects. Perhaps you are using a language like PHP, where data structures are commonly returned as associative arrays (similar to a dictionary or map in other languages). Or, maybe you just need one or two items from a table or document without retrieving the entire thing.
|
||||
|
||||
### Addressing Fields
|
||||
|
||||
Selecting a few fields in SQL is trivial; some ORMs make it easy to get to the underlying connection, and nearly all relational database drivers have the concept of a result set, where data items for the current row can be accessed by name, their index in the `SELECT` statement, or both.
|
||||
|
||||
Document databases usually provide the ability to retrieve arbitrary fields from documents, but their implementations can vary. [MongoDB][] allows you to specify `1` for fields to include (and `0` to exclude ID, returned otherwise). [RethinkDB][] provides the `.pluck` command to select certain fields and the `.without` command to exclude certain fields.
|
||||
|
||||
For documents stored in relational databases, there is syntax for selecting fields from documents similar to how columns from a table are selected. They are addressed the same way as columns which come from the table itself. As with document databases, though, the syntax varies, and may be implemented as custom operators or function calls.
|
||||
|
||||
### Indexing
|
||||
|
||||
We have yet to discuss indexes to any great extent. They can bring huge performance boosts in either data paradigm. While we'll consider them fully when we dig into document design, a short consideration here will serve us well. Both relational and document databases use a unique index on the primary key column(s) or field; we'll look at others we may need.
|
||||
|
||||
In a relational database, foreign key fields should be indexed. These databases maintain integrity by checking values as described in the constraint, and they do this every time an `INSERT` or `UPDATE` is executed which sets a new value. If the foreign key is not indexed, the database has to search every row of the table manually (a "full table scan"). These help our application as well; for our patron / checked-out book association, the index will help us identify these rows quickly, whether we are starting from the patron or the book.
|
||||
|
||||
Other indexes can be created for fields commonly found in a `WHERE` clause. If we built a "find a patron by their e-mail address" process into our library system, we would likely want to index the `email` field we would add to their record. (Doubly so if they can also use that e-mail address to sign in to the library system to access resources there.)
|
||||
|
||||
Relational indexes are not free; they take up some space, and the database's ACID guarantees apply to indexes as well. This can slow down updates, particularly the more indexes need to be updated. A great starting point for indexes is primary keys (which the database does for you), foreign keys, and commonly searched items.
|
||||
|
||||
Document database indexing is one area where vendors can distinguish their product from others. The shape of the data (arrays, sub-documents, etc.) also require more indexing options. Most allow creation of an index on an array which can be used to mimic a SQL `IN` query. Some allow indexing computed values which are actually stored in the index, and can be retrieved from there. Some also allow for all indexed values to be also stored in the index; in these cases, queries that only require those fields do not have to retrieve the actual document, as the index can satisfy the query.
|
||||
|
||||
Document indexes may not be ACID-compliant, particularly with consistency. In some cases, an index can be explicitly _not_ updated with the command that updates the data; it's executed in the background once the database system has said it's done. In other cases, the application can specifically request to wait until an index is consistent.
|
||||
|
||||
## Interim Summary
|
||||
|
||||
We have looked at [relational databases][one], [document databases][two], [trade-offs between their data stores][three], and now trade-offs from an application perspective. We have looked at both the strengths and weaknesses of each data model. What if we could get the benefits of relational data _and_ documents at the same time?
|
||||
|
||||
|
||||
[inject]: https://en.wikipedia.org/wiki/SQL_injection "SQL injection • Wikipedia"
|
||||
[owasp]: https://owasp.org/www-project-top-ten/ "OWASP Top 10"
|
||||
[MongoDB]: https://www.mongodb.com/ "MongoDB"
|
||||
[RethinkDB]: https://rethinkdb.com/ "RethinkDB"
|
||||
[one]: ./a-brief-history-of-relational-data.md "A Brief History of Relational Data • Bit Badger Solutions"
|
||||
[two]: ./what-are-documents.md "What Are Documents? • Bit Badger Solutions"
|
||||
[three]: ./relational-document-trade-offs.md "Relational / Document Trade-Offs • Bit Badger Solutions"
|
89
concepts/document-design-considerations.md
Normal file
89
concepts/document-design-considerations.md
Normal file
@ -0,0 +1,89 @@
|
||||
# Document Design Considerations
|
||||
|
||||
When designing any data store, determining how data will be retrieved is often a secondary consideration. Developers get requirements, and we immediately start thinking of how we would store the data that will be produced. Then, when it comes time to search data, produce reports, etc., the process can be painful. We have used the term "consideration" a lot (including in the title of this page!) because there a lot of ways to store the same information. Understanding how that data will be used (and why, and when) can guide design decisions.
|
||||
|
||||
As a quick example, consider a customer record. How many addresses will we store for each one? Should they be labeled? Things like state or province are a finite list of choices; do we enforce an accurate selection at the data level? Do we care about addresses that are no longer current? We could end up with anything from a blob of free-form text up to a set of tables, with pieces of the address spread out among them. How these addresses will be used will likely eliminate some options.
|
||||
|
||||
No data storage paradigm eliminates these considerations. It may take a bit more time up front, but schema changes and data migration on an operational system can take even more time (and bring complexity that may have been avoided).
|
||||
|
||||
## Recognizing Appropriate Relational Data
|
||||
|
||||
This will be a short section, as previous articles should have explicitly made the point that not all data is appropriate for a document model. If the importance of relationships between certain entities and other entities must never allow those entities to be out of sync, a document structure is not the best structure for those entities.
|
||||
|
||||
## Designing Documents
|
||||
|
||||
Having eliminated scenarios where documents are not appropriate, let's design our documents to capture data which fits that paradigm.
|
||||
|
||||
### Repeated Data
|
||||
|
||||
Many many-to-one relationships in a relational database could be represented as an array of changes in the parent document. Returning to our hotel room example, the rental history of each room could be represented as an array on the room itself. This would give us a quick way to find who was in each room at what point; and, provided our database keys lined up, we could also tell a customer which rooms we charged to their account, and for which dates.
|
||||
|
||||
The main question for this structure is this: what other queries against a room would we require? And, given how we could best answer these questions, is an array of reservations the best way to represent that? This is a key consideration for an array-in-document vs. separate multi-entry table decision. Adding a reservation, in an inlined array, is relatively trivial. However, which entity owns the reservation array? Are reservations based on the room, while related to the customer? Are they based on the customer, and associated to the room? Are they an entity unto their own (which represents multiple occurrences with multiple rows vs. inlined in a document)?
|
||||
|
||||
In this case, this author would likely have reservations as their own entity, or have reservations inlined in the customer document. It may make sense to split reservations and completed stays into separate arrays; queries for upcoming reservations would likely occur more frequently than those for completed stays, and this would narrow the data set for the former queries to only those reservations that are actually pending.
|
||||
|
||||
_(This is not "the right answer"; it is but one way it could be implemented.)_
|
||||
|
||||
One area that is more straightforward would be e-mail addresses for our customers. If we want to allow them to have more than one e-mail address on their record, this is easily represented as an inline array in the customer document. While it does mean that we cannot look up a customer by e-mail address using a straight `=` condition, we _can_ store their primary e-mail address as the first entry in this array, and use `email[0]` in cases where we need it.
|
||||
|
||||
### Related Data
|
||||
|
||||
One theme, underlying all this discussion, is that data is related to other data. These relations are where our next decision point lies. Are these relationships optional? If so, can these optional relationships be defined by their presence? If so, the relationship may be a candidate for a document property instead of a child-table relationship (with or without a foreign key).
|
||||
|
||||
Let's think through this reservation scenario a bit more. Most hotel reservations are not made for a specific room; they are usually based on room type (number and configuration of beds, extra space, etc.). The hotel knows how many rooms they have of which type, and what reservations they currently have, so they can give accurate availability numbers. However, they do not usually assign a room number when the reservation is made. This gives them the flexibility to accomodate changes with current customers - say, someone who stays over for an additional 3 days - without being disruptive to either their current customer _or_ the next customer they had assigned to the room that is now occupied.
|
||||
|
||||
We may, though, have a few regular customers who stay frequently, and they want a particular room. Since these are our "regulars," we do not want to create a system where we cannot assign a room at reservation time. (Inconveniencing your regular customers is not a recipe for success in any business!)
|
||||
|
||||
If we make a reservation its own document, we could have the following properties:
|
||||
|
||||
- ID
|
||||
- Customer ID
|
||||
- Arrival Date
|
||||
- Duration of Stay (nights)
|
||||
- Room Type
|
||||
- Room ID
|
||||
- Do Not Move (`true` or `false`, if present)
|
||||
- Special Instructions
|
||||
|
||||
Of these, the first five are required; the first identifies the reservation, the second identifies the customer, and the next three are the heart of the reservation. For most reservations, these would be the only fields in the document (or the others would be `null`). Once rooms are assigned, the room ID would be filled in. However, for our regulars, we would fill it in when they made the reservation, and we would set the "do not move" flag to indicate that this room assignment should not be changed. Special instructions could be anything ("first floor", "near stairs", etc.).
|
||||
|
||||
> [!NOTE]
|
||||
> Although Customer ID is a required field, a document database does not enforce this constraint. Managing these sorts of relationships becomes the responsibility of the application. If this were stored as an array in the customer document, we would not need the Customer ID property, and its presence in their document would establish the relationship.
|
||||
|
||||
We can apply this same optional relationship pattern to other documents. Customer service tickets could have an optional Room ID property, which would indicate if a call pertained to a specific room. These tickets could also have an array of log entries with date, user, and a narrative about what happened. This gives us another example of both optional IDs and relationship via containment.
|
||||
|
||||
### Domain Objects
|
||||
|
||||
Some readers may be thinking "Man, I'm never going to be dealing with data at this level; I just want to store my application's data". In this case, the application's structure takes the lead, and the database is there to support it. (Microsoft's Entity Framework "Code First" pioneered this concept for relational data stores.) When we say "domain object," we mean whatever the application uses to structure its data; it could be a class or a dictionary / associative array.
|
||||
|
||||
Storing and retrieving domain objects involves JSON serialization and deserialization. The domain object is serialized to JSON to store it, and deserialized from JSON to reconstitute it in the application. JSON only has six data types - array, object, string, number, boolean, and null - yet it can represent arbitrary structures using just these types.
|
||||
|
||||
In these cases, the document's structure will match that of the domain object. Instead of the way an object-relational mapper splits out other objects, arrays, etc., all the information for that domain item is in one document. This means that data access paths match those in your application. `customer.address.city` in your application can be addressed by the JSON path `$.address.city` on the customer document. Assuming the document was in a `customer` table stored in a `data` column, querying the city could be done as follows in both PostgreSQL and SQLite:
|
||||
|
||||
```sql
|
||||
SELECT data FROM customer WHERE data->'addresss'->>'city' = :city
|
||||
```
|
||||
|
||||
> [!NOTE]
|
||||
> The document libraries hosted here provide the dot-notation access for use in programs; to find all customers in Chicago, the following C# code will generate something that looks a lot like the query above.
|
||||
>
|
||||
> ```csharp
|
||||
> Find.ByFields<Customer>("customer", Field.Equal("address.city", "Chicago"));
|
||||
> ```
|
||||
|
||||
## Conclusion
|
||||
|
||||
> If you have read this entire series and arrived here - **THANK YOU**! People like you are the ones this author had in mind when he made the decision to write it.
|
||||
|
||||
The main points to take away are:
|
||||
|
||||
- Document databases are an interesting and compelling way of structuring data.
|
||||
- Common relational databases have implemented JSON document columns and functions/operators to manipulate them.
|
||||
- Using a hybrid approach allows us to avoid some relational pain points (i.e., complexity).
|
||||
- Documents are not a magic bullet; they still require design considerations.
|
||||
|
||||
Documents may not be _the_ solution for your data storage needs - or, they may! - but they are a valuable tool in your collection. JSON document columns in an otherwise-relational table are another interesting option which we did not explore here. There are many ways to incorporate the good parts of documents to reduce complexity, and you are probably already using a database which support them.
|
||||
|
||||
The libraries linked across the top of the page provide an easy, document-database style interface for storing documents in PostgreSQL and SQLite. They also provide a custom mapping function interface against database results (`Npgsql.FSharp` for F#, `ADO.NET` for C#, `PDO` for PHP, and `JDBC` for JVM languages). Instead of creating a connection; creating a command; setting up the query; iteratively binding parameters; executing the query; and looping through the results, these take a query, a collection of parameters, and a mapping function - all that other work is done, but the libraries abstract that away.
|
||||
|
||||
Whether these libraries find their way into your tool belt or not, we hope you have gained knowledge. When we reduce complexity - leading to applications which are more robust, reliable, and maintainable - everybody wins!
|
96
concepts/hybrid-data-stores.md
Normal file
96
concepts/hybrid-data-stores.md
Normal file
@ -0,0 +1,96 @@
|
||||
# Hybrid Data Stores
|
||||
|
||||
If you have been reading this series in order, you likely have thought about applications you may have worked with in the past, and considered how a document structure may look. Perhaps there were many thoughts like "Oh yeah, that makes sense," or "Oh, that would be way easier than...." _(If so, then this series of articles has had its intended effect - so far.)_
|
||||
|
||||
Conversely, there may have been other thoughts - things like "We can't do that because..." or "We really need ACID guarantees for \[scenario\]." While some information may work as documents, others cannot. This limiting factor means that the data store will stay with a relational database, in spite of the extra work it takes to maintain a rigid structure throughout the application. Besides, we're already used to it!
|
||||
|
||||
If both paragraphs above capture your thoughts, you are in the target market for a hybrid data store. Rather that jettison one data structure and switch to another, we can incorporate the good parts from both in a single data store.
|
||||
|
||||
## Relational Data Structures
|
||||
|
||||
### Can Be Flexible...
|
||||
|
||||
There is a reason that relational data stores are the default representation for data at rest throughout the industry, both professional and hobbyist. Despite its depictions as rigid and structured, it is incredibly flexible as to the rigid structure one can define - and nullable columns and foreign keys provide the ability to model that a relationship may not exist.
|
||||
|
||||
Content management systems (CMSs) are the textbook example of how one can build a flexible application using relational tables. Applications such as WordPress and SharePoint are commonplace - you likely know both without any further research - yet both are built on relational databases (MySQL and SQL Server, respectively). As of this writing, WordPress serves just over 50% of the Internet, and SharePoint is the _de facto_ standard in enterprises, serving countless sites both on the public Internet and in private networks (and serves as the backend store for Microsoft Teams' file sharing).
|
||||
|
||||
### ...But Can Be Complex
|
||||
|
||||
At one point in his career, this author joked about making a table called `the_table` with two columns (`id` - 32-character string - and `the_field`, an unbounded text field), which would have been the ultimate flexible schema for the application in question (which, at the time, used a data model that was neither relational nor document-oriented). This joke was taken as such (and appreciated) because it pointed out that, for all the complains about how inflexible our current data structure was, it _could_ be replaced with something even worse. Even the human body would be reduced to a puddle of goo were it not for the skeletal structure on which our organs rely.
|
||||
|
||||
_As segues go - is there any entity more complex than the human body? Thankfully, database structures are well above the level of molecular biology or physiology!_
|
||||
|
||||
The two examples above, though, show different approaches to the flexible-relational paradigm. The [WordPress schema][wp-schema] is not huge - 12 tables as of this writing - yet all the content for some major media sites is contained (mostly; plugins can create tables for their own use) in those tables. In the `wp_posts` table, there are:
|
||||
|
||||
- Blog posts (initial post, maybe updates, comments, categorizations, etc.)
|
||||
- Static pages (timeless content that may change infrequently; similar to this page)
|
||||
- Any other custom content item
|
||||
|
||||
The key to this is the `post_type` column, a free-form 20-character field used to indicate what sort of content this "post" represents. But, because of this flexibility, the schema supports data structures that are not valid in the application. Blog posts can have categories (broad topics to which the post applies) and tags (distinct points mentioned in the particular post), and both are stored in the `wp_term_relationships` table. Static pages, also stored in `wp_posts`, have neither of these applied to them, yet there is nothing in the data model that prevents lots of `wp_term_relationships` rows for these as well (which may be valid, if a plugin adds some other way of categorizing pages).
|
||||
|
||||
> [!NOTE]
|
||||
> The purpose of the above is not to rip on WordPress's schema. What that project has implemented with just a few tables, and the way they ensure that major and minor upgrade version are supported, is nigh-heroic. This author's biggest quibble with the WordPress schema has to do with using plural nouns for their tables; it's the "post" table, not the "posts" table, darn it....
|
||||
|
||||
If, from a data structure perspective, WordPress is complex - SharePoint blows it out of the water. It uses _so many tables_ - dynamically creating them in some cases - that any analysis and extension is very difficult. And, once an outside observer figures it out, that observer should harbor no reasonable expectation that the structure will not change with the next release.
|
||||
|
||||
Perhaps an enterprise-level application that creates sites for an arbitrary number of organizations, with an arbitrary structure and arbitrary content, needs this level of complexity. (Again - no shade on SharePoint here, but no one can claim it _does not_ use a complex schema!) This author suspects that the average reader here does not.
|
||||
|
||||
## Document Data Structures
|
||||
|
||||
> Complexity is a subsidy.<br>_<small>– Jonah Goldberg</small>_
|
||||
|
||||
The above quote has, admittedly, been yanked from its original context, but it applies here more than we may initially think. The original context refers to government regulations which impose certain burdens on businesses; any legal business must comply with them. As the compliance cost rises, businesses which cannot absorb the overhead of that compliance become non-viable. What may be "budget dust" for a large business may be a cost-prohibitive capital expenditure for a small one. Thus, the regulations end up being a protectionist subsidy for existing businesses.
|
||||
|
||||
What does this have to do with databases? Each developer who works on a project has to perform the programmer equivalent of "breaking into the market." (Sometimes, even the original developer has to get back up to speed on what they previously wrote.) Any complexity we can eliminate will make our applications more approachable and maintainable. Every step in a process represents something that can go wrong; avoiding those will make our applications more robust.
|
||||
|
||||
> [!NOTE]
|
||||
> When the relational model was developed, mass storage space was at a premium. As it turned out, structuring data into tables with relationships and non-repeated data is also the most efficient way to store it. Storing documents requires more space, as the field names are stored for each document. Since these are text documents, they compress well; it may not even be something you would notice, but it is worth evaluating.
|
||||
|
||||
### Thank You, {vendor_name}
|
||||
|
||||
The heading above is rendered correctly. Nearly every relational data store has incorporated a JSON data type; [Oracle][], [SQL Server][], [MySQL][] and [MariaDB][] _(sadly, diverging implementations implemented mostly after the project fork)_, [PostgreSQL][], and [SQLite][] have all recognized the advantages of documents, and incorporated them in their database engines to varying degrees.
|
||||
|
||||
> [!TIP]
|
||||
> As of this writing, PostgreSQL is the winner for document integration. It has two different options for JSON columns (`JSON`, which stores the original text given; and `JSONB`, which stores a parsed binary representation of the text). Additionally, its indexing options can provide efficient document access for any field in the document. It also provides querying options by "containment" (a given document is contained in the field) and by JSON Path (a given document matches an expression). SQLite's implementation was (admittedly) inspired by PostgreSQL's operators.
|
||||
|
||||
Thanks to these vendors' efforts, there is a very high likelihood that whatever relational data storage solution you may be currently using may support this hybrid structure today - no upgrades or patching needed!
|
||||
|
||||
## Mixing Relational Tables and Documents
|
||||
|
||||
So... how would we tie this all together? The solution sounds simple (and, in some cases, may be) - create tables with document columns for data where the document paradigm fits nicely, while leaving the needs-to-be-relational data in tables. As with many things in the realm of software development, though, a simple idea can lead to a complex implementation.
|
||||
|
||||
### Theory
|
||||
|
||||
Let's tackle the "simple" part (for no other reason that simplifying the data structure is _the entire point_ of this series). If we have an accounting system with balances, ledgers, debits and credits, etc., we likely have a scenario where we will need a relational, constraint-enforced data store. Customer support calls for this account, though, can vary; they can be general, they can apply to an account overall, or they can apply to a specific transaction. While a relational data store _could_ implement this, a document may be a better choice, particularly for capturing the various calls and actions which may occur over time.
|
||||
|
||||
In this scenario, we could have an `account` table along with a `account_transaction` table. Each transaction is tied to an account, as well as the transaction preceding it; we store the "balance forward", the amount of the transaction, and the new balance, as well as a mandatory link to the previous transaction. This prevents miscoded applications (or nefarious database accessors) from removing a large debit transaction, making the account have a higher balance than it should.
|
||||
|
||||
We could also have a `support_ticket` table which records communications from the customer from the initial contact through to resolution. We could easily use a document for this, with an array of notes for each communication back-and-forth between the client and the customer. This document could also have an optional link to the account or transaction to which the incident referred, as well as a link to the customer in question.
|
||||
|
||||
We - of course - do not want to lose any of that data; however, most of these relationships are optional. What happens if the account is closed - or, beyond that, if a customer is deleted? A purely relational architecture could specifically address this; however, support tickets as documents gives us a different form of traceability by default. We will still have record of the interaction, because the `support_ticket`'s presence did not prevent further business action on the account or customer. At the same time, support tickets did not prevent us from closing an account; the business remained able to take action where needed.
|
||||
|
||||
### Practice
|
||||
|
||||
As one reviews the APIs in the `Custom` type for any of the projects here, one notices that the document-returning functions take - as their last parameter - a mapping function, which translates between the database row and the expected return item. Each library has predefined functions to return domain items; return a JSON string with matching results; or write the JSON results directly to the output.
|
||||
|
||||
As these are designed to expect functions, though, this allows these libraries to be used to not only return deserialized (or raw) JSON documents, but for any required domain item. The function passed to these `Custom` calls can select one field and deserialize it; or, it can pluck various fields from a row result and construct a domain item; or, it can transform these results into some other form.
|
||||
|
||||
This enables a true relational-and/or-document (AKA hybrid) data store. Tables against which `Find`, `Document.insert`, etc. are executed are assumed to be document tables, while `Custom` functions/methods allow relational data as well. While not an object-relational model (ORM), writing a to/from mapping for a domain object allows either model to be used in the same data-access paradigm.
|
||||
|
||||
> [!NOTE]
|
||||
> These libraries provide a nice API for these actions - and, of course, are the reason this page exists! However, even if one were to never use any of these libraries, these principles still stand.
|
||||
|
||||
## Is This Right for Me?
|
||||
|
||||
If one has read from the beginning up this point, but is still looking for permission to take the leap - a dry-erase board is your friend. Diagram the tables / documents, brainstorm their interactions, consider the real-world constraints vs. the ones each paradigm lets you model/enforce via the database, and decide from there. (For this author, this is a much simpler data structure which fits all of his side projects perfectly, and one he wishes he could embrace at his more enterprise-y day job.)
|
||||
|
||||
Even if the answer is "no," please skim the top part of the next article; some design considerations transcend the document/table decision. In the final article in this series, we will consider the best way to design data structures.
|
||||
|
||||
|
||||
[wp-schema]: https://codex.wordpress.org/Database_Description "Database Description • WordPress Codex"
|
||||
[Oracle]: https://docs.oracle.com/en/database/oracle/oracle-database/21/adjsn/json-in-oracle-database.html "JSON in Oracle Database • Oracle"
|
||||
[SQL Server]: https://learn.microsoft.com/en-us/sql/relational-databases/json/json-data-sql-server?view=sql-server-ver16 "JSON Data in SQL Server • Microsoft Learn"
|
||||
[MySQL]: https://learn.microsoft.com/en-us/sql/relational-databases/json/json-data-sql-server?view=sql-server-ver16 "The JSON Data Type • MySQL"
|
||||
[MariaDB]: https://mariadb.com/kb/en/json/ "JSON Data Type • MariaDB"
|
||||
[PostgreSQL]: https://www.postgresql.org/docs/current/functions-json.html "JSON Functions • PostgreSQL"
|
||||
[SQLite]: https://sqlite.org/json1.html "JSON Functions and Operators • SQLite"
|
224
concepts/referential-integrity.md
Normal file
224
concepts/referential-integrity.md
Normal file
@ -0,0 +1,224 @@
|
||||
# Referential Integrity with Documents
|
||||
|
||||
> [!NOTE]
|
||||
> This page is a technical exploration of ways to enforce referential integrity within or among documents in PostgreSQL. It concludes with a consideration of whether this is a good idea or not. Also, while SQLite may support a similar technique, we will not be considering it here.
|
||||
|
||||
One of the hallmarks of document database is loose association between documents. In the hotel / room example, with each being its own document collection, there is no technical reason we could not delete every hotel in the database, leaving all the rooms with hotel IDs that no longer exist. This is a feature-not-a-bug, but it shows the tradeoffs inherent to selecting a data storage mechanism. In our case, this is less than ideal - but, since we are using PostgreSQL, a relational database, we can implement referential integrity if, when, and where we need it.
|
||||
|
||||
## Enforcing Referential Integrity on the Child Document
|
||||
|
||||
We can reference specific fields in a document the same way we would address a column; e.g., `data->>'Id'` will give us the ID from a JSON (or JSONB) column. However, we cannot define a foreign key constraint against an arbitrary expression. Through database triggers, though, we can accomplish the same thing.
|
||||
|
||||
Triggers are implemented in PostgreSQL through a function/trigger definition pair. A function defined as a trigger has `NEW` and `OLD` defined as the data that is being manipulated (different ones, depending on the operation; no `OLD` for `INSERT`s, no `NEW` for `DELETE`s, etc.). For our purposes here, we'll use `NEW`, as we're trying to verify the data as it's being inserted or updated.
|
||||
|
||||
```sql
|
||||
CREATE OR REPLACE FUNCTION room_hotel_id_fk() RETURNS TRIGGER AS $$
|
||||
DECLARE
|
||||
hotel_id TEXT;
|
||||
BEGIN
|
||||
SELECT data->>'Id' INTO hotel_id FROM hotel WHERE data->>'Id' = NEW.data->>'HotelId';
|
||||
IF hotel_id IS NULL THEN
|
||||
RAISE EXCEPTION 'Hotel ID % does not exist', NEW.data->>'HotelId';
|
||||
END IF;
|
||||
RETURN NEW;
|
||||
END;
|
||||
$$ LANGUAGE plpgsql;
|
||||
|
||||
CREATE OR REPLACE TRIGGER hotel_enforce_fk BEFORE INSERT OR UPDATE ON room
|
||||
FOR EACH ROW EXECUTE FUNCTION room_hotel_id_fk();
|
||||
```
|
||||
|
||||
This is as straightforward as we can make it; if the query fails to retrieve data (returning `NULL` here, not raising `NO_DATA_FOUND` like Oracle would), we raise an exception. Here's what that looks like in practice.
|
||||
|
||||
```
|
||||
hotel=# insert into room values ('{"Id": "one", "HotelId": "fifteen"}');
|
||||
ERROR: Hotel ID fifteen does not exist
|
||||
CONTEXT: PL/pgSQL function room_hotel_id_fk() line 7 at RAISE
|
||||
hotel=# insert into hotel values ('{"Id": "fifteen", "Name": "Demo Hotel"}');
|
||||
INSERT 0 1
|
||||
hotel=# insert into room values ('{"Id": "one", "HotelId": "fifteen"}');
|
||||
INSERT 0 1
|
||||
```
|
||||
|
||||
(This assumes we'll always have a `HotelId` field; [see below][] on how to create this trigger if the foreign key is optional.)
|
||||
|
||||
## Enforcing Referential Integrity on the Parent Document
|
||||
|
||||
We've only addressed half of the parent/child relationship so far; now, we need to make sure parents don't disappear.
|
||||
|
||||
### Referencing the Child Key
|
||||
|
||||
The trigger on `room` referenced the unique index in its lookup. When we try to go from `hotel` to `room`, though, we'll need to address the `HotelId` field of the `room`' document. For the best efficiency, we can index that field. (This is also a best practice for relational foreign keys.)
|
||||
|
||||
```sql
|
||||
CREATE INDEX IF NOT EXISTS idx_room_hotel_id ON room ((data->>'HotelId'));
|
||||
```
|
||||
|
||||
### `ON DELETE DO NOTHING`
|
||||
|
||||
When defining a foreign key constraint, the final part of that clause is an `ON DELETE` action; if it's excluded, it defaults to `DO NOTHING`. The effect of this is that rows cannot be deleted if they are referenced in a child table. This can be implemented by looking for any rows that reference the hotel being deleted, and raising an exception if any are found.
|
||||
|
||||
```sql
|
||||
CREATE OR REPLACE FUNCTION hotel_room_delete_prevent() RETURNS TRIGGER AS $$
|
||||
DECLARE
|
||||
has_rows BOOL;
|
||||
BEGIN
|
||||
SELECT EXISTS(SELECT 1 FROM room WHERE OLD.data->>'Id' = data->>'HotelId') INTO has_rows;
|
||||
IF has_rows THEN
|
||||
RAISE EXCEPTION 'Hotel ID % has dependent rooms; cannot delete', OLD.data->>'Id';
|
||||
END IF;
|
||||
RETURN OLD;
|
||||
END;
|
||||
$$ LANGUAGE plpgsql;
|
||||
|
||||
CREATE OR REPLACE TRIGGER hotel_room_delete BEFORE DELETE ON hotel
|
||||
FOR EACH ROW EXECUTE FUNCTION hotel_room_delete_prevent();
|
||||
```
|
||||
|
||||
This trigger in action...
|
||||
|
||||
```
|
||||
hotel=# delete from hotel where data->>'Id' = 'fifteen';
|
||||
ERROR: Hotel ID fifteen has dependent rooms; cannot delete
|
||||
CONTEXT: PL/pgSQL function hotel_room_delete_prevent() line 7 at RAISE
|
||||
hotel=# select * from room;
|
||||
data
|
||||
-------------------------------------
|
||||
{"Id": "one", "HotelId": "fifteen"}
|
||||
(1 row)
|
||||
```
|
||||
|
||||
There's that child record! We've successfully prevented an orphaned room.
|
||||
|
||||
### `ON DELETE CASCADE`
|
||||
|
||||
Rather than prevent deletion, another foreign key constraint option is to delete the dependent records as well; the delete "cascades" (like a waterfall) to the child tables. Implementing this is even less code!
|
||||
|
||||
```sql
|
||||
CREATE OR REPLACE FUNCTION hotel_room_delete_cascade() RETURNS TRIGGER AS $$
|
||||
BEGIN
|
||||
DELETE FROM room WHERE data->>'HotelId' = OLD.data->>'Id';
|
||||
RETURN OLD;
|
||||
END;
|
||||
$$ LANGUAGE plpgsql;
|
||||
|
||||
CREATE OR REPLACE TRIGGER hotel_room_delete BEFORE DELETE ON hotel
|
||||
FOR EACH ROW EXECUTE FUNCTION hotel_room_delete_cascade();
|
||||
```
|
||||
|
||||
Here is what happens when we try the same `DELETE` statement that was prevented above...
|
||||
|
||||
```
|
||||
hotel=# select * from room;
|
||||
data
|
||||
-------------------------------------
|
||||
{"Id": "one", "HotelId": "fifteen"}
|
||||
(1 row)
|
||||
|
||||
hotel=# delete from hotel where data->>'Id' = 'fifteen';
|
||||
DELETE 1
|
||||
hotel=# select * from room;
|
||||
data
|
||||
------
|
||||
(0 rows)
|
||||
```
|
||||
|
||||
We deleted a hotel, not rooms, but the rooms are now gone as well.
|
||||
|
||||
### `ON DELETE SET NULL`
|
||||
|
||||
The final option for a foreign key constraint is to set the column in the dependent table to `NULL`. There are two options to set a field to `NULL` in a `JSONB` document; we can either explicitly give the field a value of `null`, or we can remove the field from the document. As there is no schema, the latter is cleaner; PostgreSQL will return `NULL` for any non-existent field.
|
||||
|
||||
```sql
|
||||
CREATE OR REPLACE FUNCTION hotel_room_delete_set_null() RETURNS TRIGGER AS $$
|
||||
BEGIN
|
||||
UPDATE room SET data = data - 'HotelId' WHERE data->>'HotelId' = OLD.data ->> 'Id';
|
||||
RETURN OLD;
|
||||
END;
|
||||
$$ LANGUAGE plpgsql;
|
||||
|
||||
CREATE OR REPLACE TRIGGER hotel_room_delete BEFORE DELETE ON hotel
|
||||
FOR EACH ROW EXECUTE FUNCTION hotel_room_delete_set_null();
|
||||
```
|
||||
|
||||
That `-` operator is new for us. When used on a `JSON` or `JSONB` field, it removes the named field from the document.
|
||||
|
||||
Let's watch it work...
|
||||
|
||||
```
|
||||
hotel=# delete from hotel where data->>'Id' = 'fifteen';
|
||||
ERROR: Hotel ID <NULL> does not exist
|
||||
CONTEXT: PL/pgSQL function room_hotel_id_fk() line 7 at RAISE
|
||||
SQL statement "UPDATE room SET data = data - 'HotelId' WHERE data->>'HotelId' = OLD.data->>'Id'"
|
||||
PL/pgSQL function hotel_room_delete_set_null() line 3 at SQL statement
|
||||
```
|
||||
|
||||
Oops! This trigger execution fired the `BEFORE UPDATE` trigger on `room`, and it took exception to us setting that value to `NULL`. The child table trigger assumes we'll always have a value. We'll need to tweak that trigger to allow this.
|
||||
|
||||
```sql
|
||||
CREATE OR REPLACE FUNCTION room_hotel_id_nullable_fk() RETURNS TRIGGER AS $$
|
||||
DECLARE
|
||||
hotel_id TEXT;
|
||||
BEGIN
|
||||
IF NEW.data->>'HotelId' IS NOT NULL THEN
|
||||
SELECT data->>'Id' INTO hotel_id FROM hotel WHERE data->>'Id' = NEW.data->>'HotelId';
|
||||
IF hotel_id IS NULL THEN
|
||||
RAISE EXCEPTION 'Hotel ID % does not exist', NEW.data->>'HotelId';
|
||||
END IF;
|
||||
END IF;
|
||||
RETURN NEW;
|
||||
END;
|
||||
$$ LANGUAGE plpgsql;
|
||||
|
||||
CREATE OR REPLACE TRIGGER hotel_enforce_fk BEFORE INSERT OR UPDATE ON room
|
||||
FOR EACH ROW EXECUTE FUNCTION room_hotel_id_nullable_fk();
|
||||
```
|
||||
|
||||
Now, when we try to run the deletion, it works.
|
||||
|
||||
```
|
||||
hotel=# select * from room;
|
||||
data
|
||||
-------------------------------------
|
||||
{"Id": "one", "HotelId": "fifteen"}
|
||||
(1 row)
|
||||
|
||||
hotel=# delete from hotel where data->>'Id' = 'fifteen';
|
||||
DELETE 1
|
||||
hotel=# select * from room;
|
||||
data
|
||||
---------------
|
||||
{"Id": "one"}
|
||||
(1 row)
|
||||
```
|
||||
|
||||
## Should We Do This?
|
||||
|
||||
You may be thinking "Hey, this is pretty cool; why not do this everywhere?" Well, the answer is - as it is with _everything_ software-development-related - "it depends."
|
||||
|
||||
### No...?
|
||||
|
||||
The flexible, schemaless data storage paradigm that we call "document databases" allow changes to happen quickly. While "schemaless" can mean "ad hoc," in practice most documents have a well-defined structure. Not having to define columns for each item, then re-define or migrate them when things change, brings a lot of benefits.
|
||||
|
||||
What we've implemented above, in this example, complicates some processes. Sure, triggers can be disabled then re-enabled, but unlike true constraints, they do not validate existing data. If we were to disable triggers, run some updates, and re-enable them, we could end up with records that can't be saved in their current state.
|
||||
|
||||
### Yes...?
|
||||
|
||||
The lack of referential integrity in document databases can be an impediment to adoption in areas where that paradigm may be more suitable than a relational one. To be sure, there are fewer relationships in a document database whose documents have complex structures, arrays, etc. This doesn't mean that there won't be relationships, though; in our hotel example, we could easily see a "reservation" document that has the IDs of a customer and a room. Just as it didn't make much sense to embed the rooms in a hotel document, it doesn't make sense to embed customers in a room document.
|
||||
|
||||
What PostgreSQL brings to all of this is that it does not have to be an all-or-nothing decision re: referential integrity. We can implement a document store with no constraints, then apply the ones we absolutely must have. We realize we're complicating maintenance a bit (though `pgdump` will create a backup with the proper order for restoration), but we like that PostgreSQL will protect us from broken code or mistyped `UPDATE` statements.
|
||||
|
||||
## Going Further
|
||||
|
||||
As the trigger functions are executing SQL, it would be possible to create a set of reusable trigger functions that take table/column as parameters. Dynamic SQL in PL/pgSQL was additional complexity that would have distracted from the concepts. Feel free to take the examples above and make them reusable.
|
||||
|
||||
Finally, one piece we will not cover is `CHECK` constraints. These can be applied to tables using the `data->>'Key'` syntax, and can be used to apply more of a schema feel to the unstructured `JSONB` document. PostgreSQL's handling of JSON data really is first-class and unopinionated; you can use as much or as little as you like!
|
||||
|
||||
[« Back to Advanced Usage for `BitBadger.Documents`][adv]
|
||||
|
||||
[« Back to Advanced Usage for `PDODocument`][adv-pdo]
|
||||
|
||||
|
||||
[see below]: #on-delete-set-null
|
||||
[adv]: https://bitbadger.solutions/open-source/relational-documents/dotnet/advanced-usage.html "Advanced Usage • BitBadger.Documents • Bit Badger Solutions"
|
||||
[adv-pdo]: https://bitbadger.solutions/open-source/relational-documents/php/advanced-usage.html "Advanced Usage • PDODocument • Bit Badger Solutions"
|
75
concepts/relational-document-trade-offs.md
Normal file
75
concepts/relational-document-trade-offs.md
Normal file
@ -0,0 +1,75 @@
|
||||
# Relational / Document Trade-Offs
|
||||
|
||||
> There are no solutions. There are only trade-offs.<br>_<small>— Thomas Sowell</small>_
|
||||
|
||||
While the context of this quote is economics, it is a concept that has many applications, including this topic. There are generally accepted principles of data storage, proved in enterprise applications representing billions in commerce annually. The site you're reading is written by one person, whose business has occasionally crossed the threshold to profitable (but it's been a while). Do we think this site has the same data storage needs as a Fortune 50 enterprise?
|
||||
|
||||
Some would say yes. To get to this page, you have likely clicked links that needed to point to pages that actually exist. The software running the site needs to know who I am, and record me as the author of this (and every other) page. It has content, and that content is related. Every time I save a page edit, the software records a revision; each revision needs to be tied to the right page, and if a page goes away, all its revisions should as well.
|
||||
|
||||
Most people, though, would probably say no. _Of course_ I do not need large, distributed data centers with dozens of employees supporting my data storage needs. Even if I structure my database poorly, leftover revisions from a deleted page are likely not going to cause a blip in performance, much less fill up a disk. If I do something to mess up a database, in the worst case, I can drop back to the previous night's backup.
|
||||
|
||||
"OK, when did become about the author?", you may be thinking. It isn't _(though if you would like to help make this profitable, reach out!)_; it's an illustration that, while the principles are good - and I'm about to defend them - they are not the only way. By understanding the principles, and the trade-offs, you may be able to reduce complexity in your application.
|
||||
|
||||
## The ACID Test
|
||||
|
||||
Relational databases, as a general rule, are [ACID][]-compliant. This set of principles (summarized) mean that:
|
||||
|
||||
* Transactions are treated as a single unit, whether they are a single statement or multiple statements; it all works, or it all fails (an "atomic" transaction, atomicity)
|
||||
* A transaction cannot leave the data store in an inconsistent state; all constraints must be satisfied (consistency)
|
||||
* Concurrent transactions cannot see other in-progress transactions (isolation)
|
||||
* Transactions reported as successful will still be there, even if the server goes down, is interrupted, etc. (durability)
|
||||
|
||||
These principles were a part of the data structure we designed in the first page. The links between the `author` and `book`, and `patron` and `book`, fall under consistency; if we tried to check a book out to patron 1234, and that patron did not exist, the transaction would fail. If two librarians are checking out two different books to two different patrons at the same instant, there should be no problem (isolation). However, if they are trying to check out the last copy of the same book - well, at that point, we must decide how to handle it; absent handling strategies, the second attempt will fail (isolation, consistency).
|
||||
|
||||
## Distributed Data
|
||||
|
||||
Even with advances in CPU and storage, there are limits to what one database server can do. "Edge computing," pushing content as close to its consumer as possible, is easy to do with static files, but can be more challenging for data - especially if ACID data is required. There are several strategies, and their complexities are well beyond our scope here; we'll summarize a few here, because it will help with our consideration.
|
||||
|
||||
* Sharding - Data within the database is physically placed in the database based on the value of a field. "Region," "year," and "first letter of last name" are all valid sharding strategies.
|
||||
* Replicas - The database, in its entirety, is replicated to other locations. Read-only processes can look at these replicas, rather than the main database, if no updates are required. This reduces the load on the main database, and a replica can be promoted to main if the main becomes unavailable.
|
||||
* Clustering - A clustered setup designates one instance as the controller, and other instances as workers. (Often, the controller can also be a worker; that's just not its main job.) A worker can read and write, and communicates writes to the controller, which then distributes updates to the other workers. The term "eventual consistency" is often used with this structure.
|
||||
|
||||
Many document databases expect to be clustered from initial install; understanding that makes a lot of their other decisions make sense.
|
||||
|
||||
## Do We Need...
|
||||
|
||||
Most of the trade-offs to consider revolve around needs concerning aspects of ACID. We'll look at the first three; while there may be esoteric applications that do not need durability, I'm not aware of any relational or document databases that do not guarantee that.
|
||||
|
||||
### Atomicity?
|
||||
|
||||
While some document databases do support transactions, most guarantee statement-level atomicity, not transaction-level atomicity. To think through an example, let's think through removing a patron from our library. We would not want deletion of a patron to succeed if they have any books checked out, but if they have brought them back and want to close out their account, we want to handle that in one transaction. _(In practice, we would probably inactivate them; but, for now, they're gone.)_
|
||||
|
||||
In a relational database, we can do this easily. When deleting a patron, the application can look for the books they have checked out, display a list, and ask "Has the patron returned these books?" If the librarian clicks "yes", the application can start a transaction; delete the `book_checked_out` rows for the patron; delete the patron; then commit the transaction. If that is successful, we know that the books have been returned _and_ the patron has been deleted.
|
||||
|
||||
In our document database, we may not be able to do that. (Some databases do support transactions, but these may have different options.) Without transactions, we may need to execute more queries, and each one could succeed or fail on its own. Remember our document example has the checked-out books stored as an array within the `book` document. If the database supports removing items from an array, we can do that with one query; if not, we will need to retrieve the checked-out books, alter each array to exclude the patron, then update each book. Finally, we could execute a query to delete the patron.
|
||||
|
||||
A built-in mitigation for some of this comes in the form of the document itself. The more information stored within the document, the lower the risk that multiple queries will be needed. In our example, we do, but it's a bit contrived as well. For checking in a book, we just need to remove the checkout from the array. In a document database that does support in-place array manipulation, the transaction is a single query, just as it would be a single `DELETE` in the relational structure.
|
||||
|
||||
### Consistency?
|
||||
|
||||
No one says "I don't need consistent data - just give me something!" However, consistency guarantees come with a cost. Relational databases must validate all constraints, which means that the developer must specify all constraints. This constraint enforcement can complicate backup and restore, which must be done in a certain order (though the relational workaround is to disable the constraints, load the data, then enable the constraints; if they fail, the backup was bad).
|
||||
|
||||
For document databases, consistency is not defined as constraints in the database. This does not mean that the logical constraints don't exist (remember, most data has structure _and_ is related to other data), but it shifts responsibility for maintaining those constraints to the application. For example, this site uses document storage within a SQLite database, a hybrid concept we'll discuss more fully as we move into the libraries we've written to make this easier. The pages are documents, but the revisions are stored in a relational table. When a page is deleted, SQLite makes no attempt to keep its revisions from being orphaned.
|
||||
|
||||
The knowledge that the database makes no guarantees can bleed into how effective documents should be designed (also a future topic). Robust applications should treat most relationships as optional, unless its absence is one the application cannot work around. For example, the software that runs this site also supports blog posts and categories under which those posts can be assigned. The absence of a category should not prevent a post from displaying. The logic to delete categories also removes them from the array of IDs for a post, but there is no enforcement for that at the database level.
|
||||
|
||||
One note about "eventual consistency" - in practice, the "eventual" part is rarely an issue. Absent some huge network latency, most eventual consistency queries are consistent within the first second, and the vast majority are consistent within a few more. It's not consistent as a computer scientist would define it, but it's usually consistent enough for many uses.
|
||||
|
||||
### Isolation?
|
||||
|
||||
The [article linked above][ACID] has a good description of an isolation failure, under the heading with the same name. A relational database with ACID guarantees will usually throw a deadlock condition on one or both of the updates. Document databases can have different ways of handling this scenario, but they usually end up with some form of "last update wins," which can result in both "phantom reads" (where a document matches a query but its contents do not match by the time it is retrieved from disk) and lost updates.
|
||||
|
||||
That sounds terrible - why doesn't our consideration end here? The main reason is that isolation failures only occur with writes (updates), and they only apply to single documents. If your data is read more than written, or written-all-at-once then read, this is a low-risk issue. If you have one person updating the data, the risk rounds down to non-existent. Even in a multi-user environment, the likelihood of the same document being modified at the exact same time by different users is very, very low.
|
||||
|
||||
The concern should not be ignored; it would not be a principle of data integrity if it were not important. As with consistency, some document databases have the ability to require isolation on certain commands, and they should be used; the slight bit of extra time it will take the query to complete is likely much less than you would spend unwinding what would probably look like a data "glitch." If the document database does not have a way to ensure isolation, consider application-level mitigations in cases where conflicting updates may occur.
|
||||
|
||||
## A Final Consideration
|
||||
|
||||
As mentioned above, most document databases are designed with multiple instances in mind. What they do well is a quick update locally, then communicate that change up to the controller. You won't find things like sequences or automatically-increasing numeric IDs, because there is no real way to implement that in a distributed system. If you are using a single instance of a document database, many (but not all!) of the ACID concerns and exceptions go away. If an update requires a "quorum" of servers to report a successful update, but the entire cluster is 1 combination controller / worker, using things like transactions or isolation (if supported) will have no appreciable performance effect on your application.
|
||||
|
||||
---
|
||||
|
||||
Understanding the trade-offs in what we lose and gain from sticking to ACID or deviating from it can help guide our decision about which we want to target - and, once that decision is made, we will be writing an application which utilizes that data store. While the considerations here focused on the database itself, we'll turn to trade-offs in application development in our next section.
|
||||
|
||||
|
||||
[ACID]: https://en.wikipedia.org/wiki/ACID "ACID • Wikipedia"
|
16
concepts/toc.yml
Normal file
16
concepts/toc.yml
Normal file
@ -0,0 +1,16 @@
|
||||
- name: A Brief History of Relational Data
|
||||
href: a-brief-history-of-relational-data.md
|
||||
- name: What Are Documents?
|
||||
href: what-are-documents.md
|
||||
- name: Relational / Document Trade-Offs
|
||||
href: relational-document-trade-offs.md
|
||||
- name: Application Trade-Offs
|
||||
href: application-trade-offs.md
|
||||
- name: Hybrid Data Stores
|
||||
href: hybrid-data-stores.md
|
||||
- name: Document Design Considerations
|
||||
href: document-design-considerations.md
|
||||
- name: Appendix
|
||||
items:
|
||||
- name: Referential Integrity with Documents
|
||||
href: referential-integrity.md
|
118
concepts/what-are-documents.md
Normal file
118
concepts/what-are-documents.md
Normal file
@ -0,0 +1,118 @@
|
||||
# What Are Documents?
|
||||
|
||||
## Structure Optional
|
||||
|
||||
The majority of the [previous page][prev] was dedicated to describing a conceptual structure of our data, and how that is structured in a high-level language with an ORM library. This is not a bad thing on its own; most data has a defined structure. What happens when that structure changes? Or, what happens when we may not know the structure?
|
||||
|
||||
This is where the document database can provide benefits. We did not show the SQL to create the tables in the library example, but our book type might look something like this in SQLite:
|
||||
|
||||
```sql
|
||||
CREATE TABLE book (
|
||||
id NUMBER NOT NULL PRIMARY KEY,
|
||||
title TEXT NOT NULL,
|
||||
copies_on_hand INTEGER NOT NULL DEFAULT 0);
|
||||
```
|
||||
|
||||
If we wanted to add, for example, the date the library obtained the book, we would have to change the structure of the table...
|
||||
|
||||
```sql
|
||||
ALTER TABLE book ADD COLUMN date_obtained DATE;
|
||||
```
|
||||
|
||||
Document databases do not require anything like this. For example, creating a `book` collection in MongoDB, using their JavaScript API, is...
|
||||
|
||||
```javascript
|
||||
db.createCollection('book')
|
||||
```
|
||||
|
||||
The only structure requirement is that each document have some field that can serve as an identifier for documents in that table. MongoDB uses `_id` by default, but that can be configured by collection.
|
||||
|
||||
## Mapping the Entities
|
||||
|
||||
In our library, we had books, authors, and patrons as entities. In an equivalent document database setup, we would likely still have separate collections for each. A `book` document might look something like...
|
||||
|
||||
```json
|
||||
{
|
||||
"Id": 342136,
|
||||
"Title": "Little Women",
|
||||
"CopiesOnHand": 3
|
||||
}
|
||||
```
|
||||
|
||||
Because no assumptions are made on structure, if we began adding books with a `DateObtained` field, the database would simply add it, no questions asked.
|
||||
|
||||
```json
|
||||
{
|
||||
"Id": 452343,
|
||||
"Title": "The Hunt for Red October",
|
||||
"DateObtained": "1986-10-20",
|
||||
"CopiesOnHand": 1
|
||||
}
|
||||
```
|
||||
|
||||
The only field the database cares about is `Id`, assuming we specified that for our collection's ID.
|
||||
|
||||
## Mapping the Relations
|
||||
|
||||
We certainly could bring `book_author` and `book_checked_out` across as documents in their own collection. However, document databases do not (generally) have the concept of foreign keys.
|
||||
|
||||
Let's first tackle the book/author relationship. JSON has an array type, which allows multiple entries of the same type to be entered. We can add an `Authors` property to our `book` document:
|
||||
|
||||
```json
|
||||
{
|
||||
"Id": 342136,
|
||||
"Title": "Little Women",
|
||||
"Authors": [55923],
|
||||
"CopiesOnHand": 3
|
||||
}
|
||||
```
|
||||
|
||||
With this structure, if we're rendering search results and want to display the author's name(s) next to the title, we will either need to query the `author` collection for each ID in our `Authors` array, or come up with a projection that crosses two collections. Since we're still storing properties of a `book`, though, we could include the author's name.
|
||||
|
||||
```json
|
||||
{
|
||||
"Id": 342136,
|
||||
"Title": "Little Women",
|
||||
"Authors": [{
|
||||
"Id": 55923,
|
||||
"Name": "Alcott, Louisa May"
|
||||
}],
|
||||
"CopiesOnHand": 3
|
||||
}
|
||||
```
|
||||
|
||||
This document does a lot for us; we can now see the title and the authors all together, and the IDs being there would allow us to dig into the data further. If we were writing a Single-Page Application (SPA), this could be used without any transformation at all.
|
||||
|
||||
Conversely, any application code would have to be aware of this structure. Our C# code from the last page would now likely need a `DisplayAuthor` type, and `Authors` would be `ICollection<DisplayAuthor>`. We also see our first instance of repeated data. The next page will be a deeper discussion of the trade-offs we should consider.
|
||||
|
||||
For now, though, we still need to represent the checked out books. We can use a similar technique as we did for authors, including the return date.
|
||||
|
||||
```json
|
||||
{
|
||||
"Id": 342136,
|
||||
"Title": "Little Women",
|
||||
"Authors": [{
|
||||
"Id": 55923,
|
||||
"Name": "Alcott, Louisa May"
|
||||
}],
|
||||
"CopiesOnHand": 3,
|
||||
"CheckedOut": [{
|
||||
"Id": 45112,
|
||||
"Name": "Anderson, Alice",
|
||||
"ReturnDate": "2025-04-02"
|
||||
}, {
|
||||
"Id": 38472,
|
||||
"Name": "Brown, Barry",
|
||||
"ReturnDate": "2025-03-27"
|
||||
}]
|
||||
}
|
||||
```
|
||||
|
||||
## Structure Reconsidered
|
||||
|
||||
One of the big marketing points for document databases is their ability to handle "unstructured data." I won't go as far as saying that's something that doesn't exist, but the _vast_ majority of data described this way is data whose structure is unknown to the person considering doing something with it. The data itself has structure, but they do not know what it is when they get started - usually a prerequisite for creating the data store. On rare occasions, there may be data sets with several structures mixed together in the same set; even in these data sets, though, the cacophony usually turns out to be a finite set of structures, mixed inconsistently.
|
||||
|
||||
Keep that in mind as we look at some of the trade-offs between document and relational databases. Just as your body needs its skeletal structure against which your muscles and organs can work, your data _has_ structure. Document databases do not abstract that away.
|
||||
|
||||
|
||||
[prev]: ./a-brief-history-of-relational-data.md "A Brief History of Relational Data • Relational Documents • Bit Badger Solutions"
|
4
doc-template/public/main.css
Normal file
4
doc-template/public/main.css
Normal file
@ -0,0 +1,4 @@
|
||||
article h2 {
|
||||
border-bottom: solid 1px gray;
|
||||
margin-bottom: 1rem;
|
||||
}
|
39
docfx.json
Normal file
39
docfx.json
Normal file
@ -0,0 +1,39 @@
|
||||
{
|
||||
"$schema": "https://raw.githubusercontent.com/dotnet/docfx/main/schemas/docfx.schema.json",
|
||||
"build": {
|
||||
"content": [
|
||||
{
|
||||
"files": [
|
||||
"**/*.{md,yml}"
|
||||
],
|
||||
"exclude": [
|
||||
"_site/**"
|
||||
]
|
||||
}
|
||||
],
|
||||
"resource": [
|
||||
{
|
||||
"files": [
|
||||
"images/**",
|
||||
"bitbadger-doc.png",
|
||||
"favicon.ico"
|
||||
]
|
||||
}
|
||||
],
|
||||
"output": "_site",
|
||||
"template": [
|
||||
"default",
|
||||
"modern",
|
||||
"doc-template"
|
||||
],
|
||||
"globalMetadata": {
|
||||
"_appName": "Relational Documents",
|
||||
"_appTitle": "Relational Documents",
|
||||
"_appLogoPath": "bitbadger-doc.png",
|
||||
"_appFaviconPath": "favicon.ico",
|
||||
"_appFooter": "Hand-crafted documentation created with <a href=https://dotnet.github.io/docfx target=_blank class=external>docfx</a> by <a href=https://bitbadger.solutions target=_blank class=external>Bit Badger Solutions</a>",
|
||||
"_enableSearch": true,
|
||||
"pdf": false
|
||||
}
|
||||
}
|
||||
}
|
BIN
favicon.ico
Normal file
BIN
favicon.ico
Normal file
Binary file not shown.
After Width: | Height: | Size: 9.3 KiB |
55
index.md
Normal file
55
index.md
Normal file
@ -0,0 +1,55 @@
|
||||
---
|
||||
_layout: landing
|
||||
---
|
||||
|
||||
# Using Relational Databases as Document Stores
|
||||
|
||||
Bit Badger Solutions has developed, and continues to maintain, libraries that provide a document store interface over PostgreSQL and SQLite. If that sounds awesome, jump right in! If you want to explore the concept more fully, the second section has you covered. Either way, welcome!
|
||||
|
||||
## Code
|
||||
|
||||
These libraries provide a convenient <abbr title="Application Programming Interface">API</abbr> to treat PostgreSQL or SQLite as document stores.
|
||||
|
||||
**BitBadger.Documents** ~ [Documentation][docs-dox] ~ [Git][docs-git]<br>
|
||||
Use for .NET applications (C#, F#)
|
||||
|
||||
**PDODocument** ~ [Documentation][pdoc-dox] ~ [Git][pdoc-git]<br>
|
||||
Use for PHP applications (8.2+)
|
||||
|
||||
**solutions.bitbadger.documents** ~ Documentation _(soon)_ ~ Git _(soon)_<br>
|
||||
Use for <abbr title="Java Virtual Machine">JVM</abbr> applications (Java, Kotlin, Groovy, Scala)
|
||||
|
||||
## Concepts
|
||||
|
||||
When we use the term "documents" in the context of databases, we are referring to a database that stores its entries in a data format (usually a form of JavaScript Object Notation, or JSON). Unlike relational databases, document databases tend to have a relaxed schema; often, document collections or tables are the only definition required - and some even create those on-the-fly the first time one is accessed!
|
||||
|
||||
> [!NOTE]
|
||||
> This content was originally hosted on the [Bit Badger Solutions][] main site; references to "the software that runs this site" is referencing [myWebLog][], an application which uses the .NET version of this library to store its data in a hybrid relational / document format.
|
||||
|
||||
**[A Brief History of Relational Data][hist]**<br>Before we dig in on documents, we'll take a look at some relational database concepts
|
||||
|
||||
**[What Are Documents?][what]**<br>How documents can represent flexible data structures
|
||||
|
||||
**[Relational / Document Trade-Offs][trade]**<br>Considering the practical pros and cons of different data storage paradigms
|
||||
|
||||
**[Application Trade-Offs][app]**<br>Options for applications utilizing relational or document data
|
||||
|
||||
**[Hybrid Data Stores][hybrid]**<br>Combining document and relational data paradigms
|
||||
|
||||
**[Document Design Considerations][design]**<br>How to design documents based on intended use
|
||||
|
||||
|
||||
[docs-dox]: https://bitbadger.solutions/open-source/relational-documents/dotnet/ "BitBadger.Documents • Bit Badger Solutions"
|
||||
[docs-git]: https://git.bitbadger.solutions/bit-badger/BitBadger.Documents "BitBadger.Documents • Bit Badger Solutions Git"
|
||||
[pdoc-dox]: https://bitbadger.solutions/open-source/relational-documents/php/ "PDODocument • Bit Badger Solutions"
|
||||
[pdoc-git]: https://git.bitbadger.solutions/bit-badger/pdo-document "PDODocument • Bit Badger Solutions Git"
|
||||
[jvm-dox]: ./jvm/ "solutions.bitbadger.documents • Bit Badger Solutions"
|
||||
[jvm-git]: https://git.bitbadger.solutions/bit-badger/solutions.bitbadger.documents "solutions.bitbadger.documents • Bit Badger Solutions Git"
|
||||
[Bit Badger Solutions]: https://bitbadger.solutions "Bit Badger Solutions"
|
||||
[myWebLog]: https://bitbadger.solutions/open-source/myweblog/ "myWebLog • Bit Badger Solutions"
|
||||
[hist]: ./concepts/a-brief-history-of-relational-data.md "A Brief History of Relational Data • Bit Badger Solutions"
|
||||
[what]: ./concepts/what-are-documents.md "What Are Documents? • Bit Badger Solutions"
|
||||
[trade]: ./concepts/relational-document-trade-offs.md "Relational / Document Trade-Offs • Bit Badger Solutions"
|
||||
[app]: ./concepts/application-trade-offs.md "Application Trade-Offs • Bit Badger Solutions"
|
||||
[hybrid]: ./concepts/hybrid-data-stores.md "Hybrid Data Stores • Bit Badger Solutions"
|
||||
[design]: ./concepts/document-design-considerations.md "Document Design Considerations • Bit Badger Solutions"
|
Loading…
x
Reference in New Issue
Block a user