Initial site development #1
1
.gitignore
vendored
Normal file
1
.gitignore
vendored
Normal file
@ -0,0 +1 @@
|
|||||||
|
_site
|
13
.idea/.gitignore
generated
vendored
Normal file
13
.idea/.gitignore
generated
vendored
Normal file
@ -0,0 +1,13 @@
|
|||||||
|
# Default ignored files
|
||||||
|
/shelf/
|
||||||
|
/workspace.xml
|
||||||
|
# Rider ignored files
|
||||||
|
/.idea.relational-documents.iml
|
||||||
|
/modules.xml
|
||||||
|
/projectSettingsUpdater.xml
|
||||||
|
/contentModel.xml
|
||||||
|
# Editor-based HTTP Client requests
|
||||||
|
/httpRequests/
|
||||||
|
# Datasource local storage ignored files
|
||||||
|
/dataSources/
|
||||||
|
/dataSources.local.xml
|
8
.idea/indexLayout.xml
generated
Normal file
8
.idea/indexLayout.xml
generated
Normal file
@ -0,0 +1,8 @@
|
|||||||
|
<?xml version="1.0" encoding="UTF-8"?>
|
||||||
|
<project version="4">
|
||||||
|
<component name="UserContentModel">
|
||||||
|
<attachedFolders />
|
||||||
|
<explicitIncludes />
|
||||||
|
<explicitExcludes />
|
||||||
|
</component>
|
||||||
|
</project>
|
6
.idea/vcs.xml
generated
Normal file
6
.idea/vcs.xml
generated
Normal file
@ -0,0 +1,6 @@
|
|||||||
|
<?xml version="1.0" encoding="UTF-8"?>
|
||||||
|
<project version="4">
|
||||||
|
<component name="VcsDirectoryMappings">
|
||||||
|
<mapping directory="" vcs="Git" />
|
||||||
|
</component>
|
||||||
|
</project>
|
BIN
bitbadger-doc.png
Normal file
BIN
bitbadger-doc.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 12 KiB |
97
concepts/a-brief-history-of-relational-data.md
Normal file
97
concepts/a-brief-history-of-relational-data.md
Normal file
@ -0,0 +1,97 @@
|
|||||||
|
# A Brief History of Relational Data
|
||||||
|
|
||||||
|
## Relational Data
|
||||||
|
|
||||||
|
Relational databases were not the first data structure, but when people talk about document databases, their behavior is usually contrasted with relational databases. We do not need a PhD-level knowledge of these databases, but some high-level concepts will serve us well.
|
||||||
|
|
||||||
|
The "relation" in "relational database" names the concept that instances of data can be linked (related) to other pieces of data. Think of a system that keeps track of books for a library; some items they would need to track the book's title, its author, how many copies they have, and who has a copy checked out. If they were to make a new "book" entry for each physical copy, they would end up repeating nearly everything. With a relational database, though, we could structure tables where very little would be duplicated.
|
||||||
|
|
||||||
|
- An `author` table could hold the author's name, plus biographical information, etc.
|
||||||
|
- A `patron` table could hold the library cardholder's information
|
||||||
|
- The `book` table could have the name of the book and how many copies the library owns
|
||||||
|
- A `book_author` table has the ID of a book and the ID of an author
|
||||||
|
- A `book_checked_out` table could have the ID of a book, the ID of a patron, and the return date
|
||||||
|
|
||||||
|
In this example, we have 5 tables to hold the information, and two of those are there solely for the purpose of associating entities with each other. When we think about this structure, there are some interesting ways of parsing the data that weren't covered by the description above.
|
||||||
|
|
||||||
|
- Books can have multiple authors; this structure also provides an easy way to find books that an author has written.
|
||||||
|
- We can count the occurrences of a book in the `book_checked_out` table and subtract that from the copies we own to determine how many copies are available for check out.
|
||||||
|
- We can easily track what book a single patron has checked out.
|
||||||
|
- If an author's name changes, when we update the `author` table, the system picks up the new name.
|
||||||
|
|
||||||
|
Notice the word "could" in the descriptions of the tables; there are different ways to define relations among entities, and database purists could also come up with scenarios that this structure does not cover. The intent here is to present an isolated yet non-trivial working example that we can use as we think through how this data is structured.
|
||||||
|
|
||||||
|
## The ORM Bridge
|
||||||
|
|
||||||
|
In high-level programming languages, developers create structures with data representing some entity. Most database drivers deal in tables, rows, and columns. If we were to use one of those libraries, we would end up writing several queries and lots of property setting to constitute a domain object from the relational data in our tables.
|
||||||
|
|
||||||
|
An Object-Relational Mapper (ORM) library helps to translate between these structures and the relational database. There are lots of these tools, and they have some pretty solid up-sides in most cases. Take, for example, C# objects and Microsoft's Entity Framework Core (EF Core).
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
// This is the definition of a book
|
||||||
|
public class Book
|
||||||
|
{
|
||||||
|
public long Id {get; set;} = 0L;
|
||||||
|
public string Title {get; set;} = "";
|
||||||
|
public int CopiesOnHand {get; set;} = 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
// This is how we retrieve a book
|
||||||
|
// - ctx is an open database context (connection)
|
||||||
|
// - theId is the variable with the ID of the book we want to retrieve
|
||||||
|
var book = await ctx.Books.FirstOrDefaultAsync(book => book.Id = theId);
|
||||||
|
```
|
||||||
|
|
||||||
|
This illustrates the simplicity of code using an ORM library. Rather than create a query, execute it, check to see if anything was returned, then assign each column from the result to a property in the class - it's the single line above. (If no `Book` for that ID exists, the `book` variable will be `null`.)
|
||||||
|
|
||||||
|
That's great for the book information, but that's not all we need to access; we need authors for the book and we need patrons to check them out. We're not trying to become EF Core experts, but adding this information looks something like this...
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public class Author
|
||||||
|
{
|
||||||
|
public long Id {get; set;} = 0L;
|
||||||
|
public string Name {get; set;} = ""; // naive
|
||||||
|
// dates of birth/death, bio, etc.
|
||||||
|
// A "navigation property" to find books by author
|
||||||
|
public ICollection<Book> Books {get; init;} = new List<Book>();
|
||||||
|
}
|
||||||
|
|
||||||
|
public class Patron
|
||||||
|
{
|
||||||
|
public long Id {get; set;} = 0L;
|
||||||
|
public string Name {get; set;} = "";
|
||||||
|
public string Phone {get; set;} = "";
|
||||||
|
// skipping the navigation property here
|
||||||
|
}
|
||||||
|
|
||||||
|
public class CheckOut
|
||||||
|
{
|
||||||
|
public long BookId {get; set;} = 0L;
|
||||||
|
public long PatronId {get; set;} = 0L;
|
||||||
|
public DateTime ReturnDate {get; set;} = DateTime.Now;
|
||||||
|
// "Navigation properties"
|
||||||
|
public Book Book {get; set;} = default!;
|
||||||
|
public Patron Patron {get; set;} = default!;
|
||||||
|
}
|
||||||
|
|
||||||
|
// A new Book class
|
||||||
|
public class Book
|
||||||
|
{
|
||||||
|
// properties as before, then...
|
||||||
|
// ...more navigation properties
|
||||||
|
public ICollection<Author> Authors {get; init;} = new List<Author>();
|
||||||
|
public ICollection<CheckOut> CheckOuts {get; init;} = new List<CheckOut>();
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Notice that the `Author` has a collection of `Book`s, and a `Book` has a collection of `Author`s. This is how the `book_author` table is represented. For checked-out books, we actually have a type that links a `Book` to a `Patron` and also stores the return date. EF Core's "navigation properties" are how it exposes these other entities within .NET code. If we do that same simple `Book` retrieval command from the first example, we can now traverse these properties to display the author's name, who has books checked out, and when they're due back.
|
||||||
|
|
||||||
|
## Referential Integrity
|
||||||
|
|
||||||
|
The term "referential integrity" is used to describe how these relations are kept in sync. In our example above, we wouldn't want a `book_author` record pointing to either a book or an author that does not exist. We would not want to allow a patron to be deleted while they still had books checked out. (And, we may not want to allow them to be deleted at all! If we need to keep records of who checked out what book, when, and when they were returned, we would lose those records if we deleted the patron.)
|
||||||
|
|
||||||
|
This is an area where relational databases excel. The IDs we have in some tables that point to tables where rows have that ID are called foreign keys. Relational databases allow us to define foreign keys that must exist; ones that must exist or can be missing; and what should happen when the parent record is deleted. To accomplish this, indexes will be applied to these key fields; this not only lets the integrity checks happen quickly, these can also be used to create queries to join the information efficiently.
|
||||||
|
|
||||||
|
## So... What's the Problem?
|
||||||
|
|
||||||
|
There are no problems, only trade-offs. Let's look at how we get to documents from here.
|
70
concepts/application-trade-offs.md
Normal file
70
concepts/application-trade-offs.md
Normal file
@ -0,0 +1,70 @@
|
|||||||
|
# Application Trade-Offs
|
||||||
|
|
||||||
|
## Working with Domain Objects
|
||||||
|
|
||||||
|
### ORMs
|
||||||
|
|
||||||
|
When we first started, we mentioned Object-Relational Mapper (ORM) tools. They keep developers from having to write a lot of boilerplate code for every object and data item the application uses. Many of them provide the ability to track changes to the objects it returns, allowing the application to retrieve the object, update some properties, then tell the database to save changes; the tool determines the SQL statements needed to persist that change. It can handle changes, additions, and deletions for many objects in one request.
|
||||||
|
|
||||||
|
ORMs can also be safer. [SQL injection attacks][inject], as part of injection attacks in general, were ranked by the [<abbr title="Open Web Application Security Project">OWASP</abbr> Top 10][owasp] at #1 in 2017 and #3 in 2021. An ORM tool will use parameters to safely pass field content in queries. They make the right thing to do the easy thing; this is its default behavior, and it has to be bypassed (in part or in whole) to write a vulnerable query.
|
||||||
|
|
||||||
|
The downside is that certain data structures do not translate well to relational databases. Consider a blog post with tags. In a relational database, the proper way to store these would be in a `post_tag` table, where the post's ID is repeated for each tag. Retrieving a post and its tags require multiple queries (or a nested subquery which could translate tags to a comma-delimited string). The ORM must translate this when the object is retrieved, and must decompose the domain object into its rows when updating data. _(Some relational databases, most notably PostgreSQL, do have an `ARRAY` column type; in this case, that would simplify this scenario.)_
|
||||||
|
|
||||||
|
### Serialization
|
||||||
|
|
||||||
|
In general, "serialization" refers to the process by which a domain object is converted to a program-independent format; objects can be recreated by "deserializing" the output from a former serialization. (This used to be mostly text formats - plain text, <abbr title="Extensible Markup Language">XML</abbr>, <abbr title="JavaScript Object Notation">JSON</abbr> - but can also be done used with common binary formats.) The rise of the Single Page Application (SPA), where data is exchanged using JSON, means that the majority of serialization is happening to and from JSON. As document databases store their documents in JSON form (or a binary version of it), we can build on this to get domain objects into and out of a document database.
|
||||||
|
|
||||||
|
If an application has JSON serialization strategies already defined for use in a SPA, these same strategies can (generally) be used with a document database. In some scenarios, this may completely eliminate a deserialization/serialization sequence (from the database to the application, then from the application to the browser); select the JSON from the database - and, for multiple results, combine the rows with `,` wrapped with `[` and `]`. Now, the server is passing text vs. getting text, making an object, then turning that object back into text.
|
||||||
|
|
||||||
|
One big downside, when contrasted with ORMs, is that document database drivers are likely not going to have the "update the object and save changes" paradigm of those tools. Document databases do not support row locking and other features ORMs use to ensure that their change-aware objects can be persisted.
|
||||||
|
|
||||||
|
Another difference is that an "update," in document database terms, usually refers to replacing an entire document. The only property which cannot be changed is the ID; any other property can be added, updated, or removed by an update. To partially update a document, the term is "patch." A patch specifies the JSON which should be present after the update, and search criteria to identify documents to be patched. (Patching by ID is fine.)
|
||||||
|
|
||||||
|
> This is where some consistency risk increases. There is no currency check on what the value was when it was obtained, and even if some other process has patched it, the patch command will not be rejected. While patches that use commands like "increment" or "append" (provided by some document databases) will succeed as expected, others may not. Imagine a library patron whose last name was entered as `Johnsen`. The patron goes to the check-out desk to have it updated to `Janssen` - but someone in the back was doing a quality check on new patrons, and decided to "fix the typo" by correcting it to `Johnson`. If both these updates happened after users had retrieved `Johnsen`'s record, the last one to press "Save" would overwrite the previous one.
|
||||||
|
|
||||||
|
### Overfetching
|
||||||
|
|
||||||
|
Overfetching is retrieving more data from the database than is necessary for the process being conducted. It gets its own subheading because it is a problem common to both ORMs and document databases (although, in theory, the larger a document becomes, the more the problem is exacerbated). Depending on how much extra data is being returned, it may not be an issue in practice. In most cases, though, considering overfetching should likely be postponed until the process works, but no later. When your little hobby site catches fire, what used to run great can be brought to its knees retrieving data and populating objects for no reason at all.
|
||||||
|
|
||||||
|
Consider the admin area for a blog, where a list of posts is displayed. The list probably needs the title, author, status, and published date, but little else. We do not need categories, tags, comments, or even the content of the post.
|
||||||
|
|
||||||
|
Both ORMs and document databases usually provide "projections" (even if they use a different term for it). A projection is most simply understood as a different view of the data. For an ORM, perhaps we create a `post_list` view, and a domain object for those items; we can now query that view and only select the data we need. For document databases, some have some form of "without" statement that will exclude items from the document. (The application needs to handle these being missing.) Documents may be able to be patched in the result, leaving the database itself unchanged. (The query, paraphrased, is "give me the `post` document patched with the JSON object `{"tags": [], "text": ""}`." The document returned would have an empty list of tags and no text.)
|
||||||
|
|
||||||
|
## Working with Data
|
||||||
|
|
||||||
|
Documents do not have to be turned into domain objects. Perhaps you are using a language like PHP, where data structures are commonly returned as associative arrays (similar to a dictionary or map in other languages). Or, maybe you just need one or two items from a table or document without retrieving the entire thing.
|
||||||
|
|
||||||
|
### Addressing Fields
|
||||||
|
|
||||||
|
Selecting a few fields in SQL is trivial; some ORMs make it easy to get to the underlying connection, and nearly all relational database drivers have the concept of a result set, where data items for the current row can be accessed by name, their index in the `SELECT` statement, or both.
|
||||||
|
|
||||||
|
Document databases usually provide the ability to retrieve arbitrary fields from documents, but their implementations can vary. [MongoDB][] allows you to specify `1` for fields to include (and `0` to exclude ID, returned otherwise). [RethinkDB][] provides the `.pluck` command to select certain fields and the `.without` command to exclude certain fields.
|
||||||
|
|
||||||
|
For documents stored in relational databases, there is syntax for selecting fields from documents similar to how columns from a table are selected. They are addressed the same way as columns which come from the table itself. As with document databases, though, the syntax varies, and may be implemented as custom operators or function calls.
|
||||||
|
|
||||||
|
### Indexing
|
||||||
|
|
||||||
|
We have yet to discuss indexes to any great extent. They can bring huge performance boosts in either data paradigm. While we'll consider them fully when we dig into document design, a short consideration here will serve us well. Both relational and document databases use a unique index on the primary key column(s) or field; we'll look at others we may need.
|
||||||
|
|
||||||
|
In a relational database, foreign key fields should be indexed. These databases maintain integrity by checking values as described in the constraint, and they do this every time an `INSERT` or `UPDATE` is executed which sets a new value. If the foreign key is not indexed, the database has to search every row of the table manually (a "full table scan"). These help our application as well; for our patron / checked-out book association, the index will help us identify these rows quickly, whether we are starting from the patron or the book.
|
||||||
|
|
||||||
|
Other indexes can be created for fields commonly found in a `WHERE` clause. If we built a "find a patron by their e-mail address" process into our library system, we would likely want to index the `email` field we would add to their record. (Doubly so if they can also use that e-mail address to sign in to the library system to access resources there.)
|
||||||
|
|
||||||
|
Relational indexes are not free; they take up some space, and the database's ACID guarantees apply to indexes as well. This can slow down updates, particularly the more indexes need to be updated. A great starting point for indexes is primary keys (which the database does for you), foreign keys, and commonly searched items.
|
||||||
|
|
||||||
|
Document database indexing is one area where vendors can distinguish their product from others. The shape of the data (arrays, sub-documents, etc.) also require more indexing options. Most allow creation of an index on an array which can be used to mimic a SQL `IN` query. Some allow indexing computed values which are actually stored in the index, and can be retrieved from there. Some also allow for all indexed values to be also stored in the index; in these cases, queries that only require those fields do not have to retrieve the actual document, as the index can satisfy the query.
|
||||||
|
|
||||||
|
Document indexes may not be ACID-compliant, particularly with consistency. In some cases, an index can be explicitly _not_ updated with the command that updates the data; it's executed in the background once the database system has said it's done. In other cases, the application can specifically request to wait until an index is consistent.
|
||||||
|
|
||||||
|
## Interim Summary
|
||||||
|
|
||||||
|
We have looked at [relational databases][one], [document databases][two], [trade-offs between their data stores][three], and now trade-offs from an application perspective. We have looked at both the strengths and weaknesses of each data model. What if we could get the benefits of relational data _and_ documents at the same time?
|
||||||
|
|
||||||
|
|
||||||
|
[inject]: https://en.wikipedia.org/wiki/SQL_injection "SQL injection • Wikipedia"
|
||||||
|
[owasp]: https://owasp.org/www-project-top-ten/ "OWASP Top 10"
|
||||||
|
[MongoDB]: https://www.mongodb.com/ "MongoDB"
|
||||||
|
[RethinkDB]: https://rethinkdb.com/ "RethinkDB"
|
||||||
|
[one]: ./a-brief-history-of-relational-data.md "A Brief History of Relational Data • Bit Badger Solutions"
|
||||||
|
[two]: ./what-are-documents.md "What Are Documents? • Bit Badger Solutions"
|
||||||
|
[three]: ./relational-document-trade-offs.md "Relational / Document Trade-Offs • Bit Badger Solutions"
|
75
concepts/relational-document-trade-offs.md
Normal file
75
concepts/relational-document-trade-offs.md
Normal file
@ -0,0 +1,75 @@
|
|||||||
|
# Relational / Document Trade-Offs
|
||||||
|
|
||||||
|
> There are no solutions. There are only trade-offs.<br>_<small>— Thomas Sowell</small>_
|
||||||
|
|
||||||
|
While the context of this quote is economics, it is a concept that has many applications, including this topic. There are generally accepted principles of data storage, proved in enterprise applications representing billions in commerce annually. The site you're reading is written by one person, whose business has occasionally crossed the threshold to profitable (but it's been a while). Do we think this site has the same data storage needs as a Fortune 50 enterprise?
|
||||||
|
|
||||||
|
Some would say yes. To get to this page, you have likely clicked links that needed to point to pages that actually exist. The software running the site needs to know who I am, and record me as the author of this (and every other) page. It has content, and that content is related. Every time I save a page edit, the software records a revision; each revision needs to be tied to the right page, and if a page goes away, all its revisions should as well.
|
||||||
|
|
||||||
|
Most people, though, would probably say no. _Of course_ I do not need large, distributed data centers with dozens of employees supporting my data storage needs. Even if I structure my database poorly, leftover revisions from a deleted page are likely not going to cause a blip in performance, much less fill up a disk. If I do something to mess up a database, in the worst case, I can drop back to the previous night's backup.
|
||||||
|
|
||||||
|
"OK, when did become about the author?", you may be thinking. It isn't _(though if you would like to help make this profitable, reach out!)_; it's an illustration that, while the principles are good - and I'm about to defend them - they are not the only way. By understanding the principles, and the trade-offs, you may be able to reduce complexity in your application.
|
||||||
|
|
||||||
|
## The ACID Test
|
||||||
|
|
||||||
|
Relational databases, as a general rule, are [ACID][]-compliant. This set of principles (summarized) mean that:
|
||||||
|
|
||||||
|
* Transactions are treated as a single unit, whether they are a single statement or multiple statements; it all works, or it all fails (an "atomic" transaction, atomicity)
|
||||||
|
* A transaction cannot leave the data store in an inconsistent state; all constraints must be satisfied (consistency)
|
||||||
|
* Concurrent transactions cannot see other in-progress transactions (isolation)
|
||||||
|
* Transactions reported as successful will still be there, even if the server goes down, is interrupted, etc. (durability)
|
||||||
|
|
||||||
|
These principles were a part of the data structure we designed in the first page. The links between the `author` and `book`, and `patron` and `book`, fall under consistency; if we tried to check a book out to patron 1234, and that patron did not exist, the transaction would fail. If two librarians are checking out two different books to two different patrons at the same instant, there should be no problem (isolation). However, if they are trying to check out the last copy of the same book - well, at that point, we must decide how to handle it; absent handling strategies, the second attempt will fail (isolation, consistency).
|
||||||
|
|
||||||
|
## Distributed Data
|
||||||
|
|
||||||
|
Even with advances in CPU and storage, there are limits to what one database server can do. "Edge computing," pushing content as close to its consumer as possible, is easy to do with static files, but can be more challenging for data - especially if ACID data is required. There are several strategies, and their complexities are well beyond our scope here; we'll summarize a few here, because it will help with our consideration.
|
||||||
|
|
||||||
|
* Sharding - Data within the database is physically placed in the database based on the value of a field. "Region," "year," and "first letter of last name" are all valid sharding strategies.
|
||||||
|
* Replicas - The database, in its entirety, is replicated to other locations. Read-only processes can look at these replicas, rather than the main database, if no updates are required. This reduces the load on the main database, and a replica can be promoted to main if the main becomes unavailable.
|
||||||
|
* Clustering - A clustered setup designates one instance as the controller, and other instances as workers. (Often, the controller can also be a worker; that's just not its main job.) A worker can read and write, and communicates writes to the controller, which then distributes updates to the other workers. The term "eventual consistency" is often used with this structure.
|
||||||
|
|
||||||
|
Many document databases expect to be clustered from initial install; understanding that makes a lot of their other decisions make sense.
|
||||||
|
|
||||||
|
## Do We Need...
|
||||||
|
|
||||||
|
Most of the trade-offs to consider revolve around needs concerning aspects of ACID. We'll look at the first three; while there may be esoteric applications that do not need durability, I'm not aware of any relational or document databases that do not guarantee that.
|
||||||
|
|
||||||
|
### Atomicity?
|
||||||
|
|
||||||
|
While some document databases do support transactions, most guarantee statement-level atomicity, not transaction-level atomicity. To think through an example, let's think through removing a patron from our library. We would not want deletion of a patron to succeed if they have any books checked out, but if they have brought them back and want to close out their account, we want to handle that in one transaction. _(In practice, we would probably inactivate them; but, for now, they're gone.)_
|
||||||
|
|
||||||
|
In a relational database, we can do this easily. When deleting a patron, the application can look for the books they have checked out, display a list, and ask "Has the patron returned these books?" If the librarian clicks "yes", the application can start a transaction; delete the `book_checked_out` rows for the patron; delete the patron; then commit the transaction. If that is successful, we know that the books have been returned _and_ the patron has been deleted.
|
||||||
|
|
||||||
|
In our document database, we may not be able to do that. (Some databases do support transactions, but these may have different options.) Without transactions, we may need to execute more queries, and each one could succeed or fail on its own. Remember our document example has the checked-out books stored as an array within the `book` document. If the database supports removing items from an array, we can do that with one query; if not, we will need to retrieve the checked-out books, alter each array to exclude the patron, then update each book. Finally, we could execute a query to delete the patron.
|
||||||
|
|
||||||
|
A built-in mitigation for some of this comes in the form of the document itself. The more information stored within the document, the lower the risk that multiple queries will be needed. In our example, we do, but it's a bit contrived as well. For checking in a book, we just need to remove the checkout from the array. In a document database that does support in-place array manipulation, the transaction is a single query, just as it would be a single `DELETE` in the relational structure.
|
||||||
|
|
||||||
|
### Consistency?
|
||||||
|
|
||||||
|
No one says "I don't need consistent data - just give me something!" However, consistency guarantees come with a cost. Relational databases must validate all constraints, which means that the developer must specify all constraints. This constraint enforcement can complicate backup and restore, which must be done in a certain order (though the relational workaround is to disable the constraints, load the data, then enable the constraints; if they fail, the backup was bad).
|
||||||
|
|
||||||
|
For document databases, consistency is not defined as constraints in the database. This does not mean that the logical constraints don't exist (remember, most data has structure _and_ is related to other data), but it shifts responsibility for maintaining those constraints to the application. For example, this site uses document storage within a SQLite database, a hybrid concept we'll discuss more fully as we move into the libraries we've written to make this easier. The pages are documents, but the revisions are stored in a relational table. When a page is deleted, SQLite makes no attempt to keep its revisions from being orphaned.
|
||||||
|
|
||||||
|
The knowledge that the database makes no guarantees can bleed into how effective documents should be designed (also a future topic). Robust applications should treat most relationships as optional, unless its absence is one the application cannot work around. For example, the software that runs this site also supports blog posts and categories under which those posts can be assigned. The absence of a category should not prevent a post from displaying. The logic to delete categories also removes them from the array of IDs for a post, but there is no enforcement for that at the database level.
|
||||||
|
|
||||||
|
One note about "eventual consistency" - in practice, the "eventual" part is rarely an issue. Absent some huge network latency, most eventual consistency queries are consistent within the first second, and the vast majority are consistent within a few more. It's not consistent as a computer scientist would define it, but it's usually consistent enough for many uses.
|
||||||
|
|
||||||
|
### Isolation?
|
||||||
|
|
||||||
|
The [article linked above][ACID] has a good description of an isolation failure, under the heading with the same name. A relational database with ACID guarantees will usually throw a deadlock condition on one or both of the updates. Document databases can have different ways of handling this scenario, but they usually end up with some form of "last update wins," which can result in both "phantom reads" (where a document matches a query but its contents do not match by the time it is retrieved from disk) and lost updates.
|
||||||
|
|
||||||
|
That sounds terrible - why doesn't our consideration end here? The main reason is that isolation failures only occur with writes (updates), and they only apply to single documents. If your data is read more than written, or written-all-at-once then read, this is a low-risk issue. If you have one person updating the data, the risk rounds down to non-existent. Even in a multi-user environment, the likelihood of the same document being modified at the exact same time by different users is very, very low.
|
||||||
|
|
||||||
|
The concern should not be ignored; it would not be a principle of data integrity if it were not important. As with consistency, some document databases have the ability to require isolation on certain commands, and they should be used; the slight bit of extra time it will take the query to complete is likely much less than you would spend unwinding what would probably look like a data "glitch." If the document database does not have a way to ensure isolation, consider application-level mitigations in cases where conflicting updates may occur.
|
||||||
|
|
||||||
|
## A Final Consideration
|
||||||
|
|
||||||
|
As mentioned above, most document databases are designed with multiple instances in mind. What they do well is a quick update locally, then communicate that change up to the controller. You won't find things like sequences or automatically-increasing numeric IDs, because there is no real way to implement that in a distributed system. If you are using a single instance of a document database, many (but not all!) of the ACID concerns and exceptions go away. If an update requires a "quorum" of servers to report a successful update, but the entire cluster is 1 combination controller / worker, using things like transactions or isolation (if supported) will have no appreciable performance effect on your application.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
Understanding the trade-offs in what we lose and gain from sticking to ACID or deviating from it can help guide our decision about which we want to target - and, once that decision is made, we will be writing an application which utilizes that data store. While the considerations here focused on the database itself, we'll turn to trade-offs in application development in our next section.
|
||||||
|
|
||||||
|
|
||||||
|
[ACID]: https://en.wikipedia.org/wiki/ACID "ACID • Wikipedia"
|
8
concepts/toc.yml
Normal file
8
concepts/toc.yml
Normal file
@ -0,0 +1,8 @@
|
|||||||
|
- name: A Brief History of Relational Data
|
||||||
|
href: a-brief-history-of-relational-data.md
|
||||||
|
- name: What Are Documents?
|
||||||
|
href: what-are-documents.md
|
||||||
|
- name: Relational / Document Trade-Offs
|
||||||
|
href: relational-document-trade-offs.md
|
||||||
|
- name: Application Trade-Offs
|
||||||
|
href: application-trade-offs.md
|
115
concepts/what-are-documents.md
Normal file
115
concepts/what-are-documents.md
Normal file
@ -0,0 +1,115 @@
|
|||||||
|
# What Are Documents?
|
||||||
|
|
||||||
|
## Structure Optional
|
||||||
|
|
||||||
|
The majority of the [previous page][prev] was dedicated to describing a conceptual structure of our data, and how that is structured in a high-level language with an ORM library. This is not a bad thing on its own; most data has a defined structure. What happens when that structure changes? Or, what happens when we may not know the structure?
|
||||||
|
|
||||||
|
This is where the document database can provide benefits. We did not show the SQL to create the tables in the library example, but our book type might look something like this in SQLite:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
CREATE TABLE book (
|
||||||
|
id NUMBER NOT NULL PRIMARY KEY,
|
||||||
|
title TEXT NOT NULL,
|
||||||
|
copies_on_hand INTEGER NOT NULL DEFAULT 0);
|
||||||
|
```
|
||||||
|
|
||||||
|
If we wanted to add, for example, the date the library obtained the book, we would have to change the structure of the table...
|
||||||
|
|
||||||
|
```sql
|
||||||
|
ALTER TABLE book ADD COLUMN date_obtained DATE;
|
||||||
|
```
|
||||||
|
|
||||||
|
Document databases do not require anything like this. For example, creating a `book` collection in MongoDB, using their JavaScript API, is...
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
db.createCollection('book')
|
||||||
|
```
|
||||||
|
|
||||||
|
The only structure requirement is that each document have some field that can serve as an identifier for documents in that table. MongoDB uses `_id` by default, but that can be configured by collection.
|
||||||
|
|
||||||
|
## Mapping the Entities
|
||||||
|
|
||||||
|
In our library, we had books, authors, and patrons as entities. In an equivalent document database setup, we would likely still have separate collections for each. A `book` document might look something like...
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"Id": 342136,
|
||||||
|
"Title": "Little Women",
|
||||||
|
"CopiesOnHand": 3
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Because no assumptions are made on structure, if we began adding books with a `DateObtained` field, the database would simply add it, no questions asked.
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"Id": 452343,
|
||||||
|
"Title": "The Hunt for Red October",
|
||||||
|
"DateObtained": "1986-10-20",
|
||||||
|
"CopiesOnHand": 1
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
The only field the database cares about is `Id`, assuming we specified that for our collection's ID.
|
||||||
|
|
||||||
|
## Mapping the Relations
|
||||||
|
|
||||||
|
We certainly could bring `book_author` and `book_checked_out` across as documents in their own collection. However, document databases do not (generally) have the concept of foreign keys.
|
||||||
|
|
||||||
|
Let's first tackle the book/author relationship. JSON has an array type, which allows multiple entries of the same type to be entered. We can add an `Authors` property to our `book` document:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"Id": 342136,
|
||||||
|
"Title": "Little Women",
|
||||||
|
"Authors": [55923],
|
||||||
|
"CopiesOnHand": 3
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
With this structure, if we're rendering search results and want to display the author's name(s) next to the title, we will either need to query the `author` collection for each ID in our `Authors` array, or come up with a projection that crosses two collections. Since we're still storing properties of a `book`, though, we could include the author's name.
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"Id": 342136,
|
||||||
|
"Title": "Little Women",
|
||||||
|
"Authors": [{
|
||||||
|
"Id": 55923,
|
||||||
|
"Name": "Alcott, Louisa May"
|
||||||
|
}],
|
||||||
|
"CopiesOnHand": 3
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
This document does a lot for us; we can now see the title and the authors all together, and the IDs being there would allow us to dig into the data further. If we were writing a Single-Page Application (SPA), this could be used without any transformation at all.
|
||||||
|
|
||||||
|
Conversely, any application code would have to be aware of this structure. Our C# code from the last page would now likely need a `DisplayAuthor` type, and `Authors` would be `ICollection<DisplayAuthor>`. We also see our first instance of repeated data. The next page will be a deeper discussion of the trade-offs we should consider.
|
||||||
|
|
||||||
|
For now, though, we still need to represent the checked out books. We can use a similar technique as we did for authors, including the return date.
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"Id": 342136,
|
||||||
|
"Title": "Little Women",
|
||||||
|
"Authors": [{
|
||||||
|
"Id": 55923,
|
||||||
|
"Name": "Alcott, Louisa May"
|
||||||
|
}],
|
||||||
|
"CopiesOnHand": 3,
|
||||||
|
"CheckedOut": [{
|
||||||
|
"Id": 45112,
|
||||||
|
"Name": "Anderson, Alice",
|
||||||
|
"ReturnDate": "2025-04-02"
|
||||||
|
}, {
|
||||||
|
"Id": 38472,
|
||||||
|
"Name": "Brown, Barry",
|
||||||
|
"ReturnDate": "2025-03-27"
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Structure Reconsidered
|
||||||
|
|
||||||
|
One of the big marketing points for document databases is their ability to handle "unstructured data." I won't go as far as saying that's something that doesn't exist, but the _vast_ majority of data described this way is data whose structure is unknown to the person considering doing something with it. The data itself has structure, but they do not know what it is when they get started - usually a prerequisite for creating the data store. On rare occasions, there may be data sets with several structures mixed together in the same set; even in these data sets, though, the cacophony usually turns out to be a finite set of structures, mixed inconsistently.
|
||||||
|
|
||||||
|
Keep that in mind as we look at some of the trade-offs between document and relational databases. Just as your body needs its skeletal structure against which your muscles and organs can work, your data _has_ structure. Document databases do not abstract that away.
|
4
doc-template/public/main.css
Normal file
4
doc-template/public/main.css
Normal file
@ -0,0 +1,4 @@
|
|||||||
|
article h2 {
|
||||||
|
border-bottom: solid 1px gray;
|
||||||
|
margin-bottom: 1rem;
|
||||||
|
}
|
37
docfx.json
Normal file
37
docfx.json
Normal file
@ -0,0 +1,37 @@
|
|||||||
|
{
|
||||||
|
"$schema": "https://raw.githubusercontent.com/dotnet/docfx/main/schemas/docfx.schema.json",
|
||||||
|
"build": {
|
||||||
|
"content": [
|
||||||
|
{
|
||||||
|
"files": [
|
||||||
|
"**/*.{md,yml}"
|
||||||
|
],
|
||||||
|
"exclude": [
|
||||||
|
"_site/**"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"resource": [
|
||||||
|
{
|
||||||
|
"files": [
|
||||||
|
"images/**",
|
||||||
|
"bitbadger-doc.png"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"output": "_site",
|
||||||
|
"template": [
|
||||||
|
"default",
|
||||||
|
"modern",
|
||||||
|
"doc-template"
|
||||||
|
],
|
||||||
|
"globalMetadata": {
|
||||||
|
"_appName": "Relational Documents",
|
||||||
|
"_appTitle": "Relational Documents",
|
||||||
|
"_appLogoPath": "bitbadger-doc.png",
|
||||||
|
"_appFooter": "Hand-crafted documentation created with <a href=https://dotnet.github.io/docfx target=_blank class=external>docfx</a> by <a href=https://bitbadger.solutions target=_blank class=external>Bit Badger Solutions</a>",
|
||||||
|
"_enableSearch": true,
|
||||||
|
"pdf": false
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
49
index.md
Normal file
49
index.md
Normal file
@ -0,0 +1,49 @@
|
|||||||
|
---
|
||||||
|
_layout: landing
|
||||||
|
---
|
||||||
|
|
||||||
|
# Using Relational Databases as Document Stores
|
||||||
|
|
||||||
|
_(this is a work-in-progress landing page for libraries that allow PostgreSQL and SQLite to be treated as document databases; it will eventually explain the concepts behind this, allowing the documentation for each library to focus more on "how" and less on "why")_
|
||||||
|
|
||||||
|
## Libraries
|
||||||
|
|
||||||
|
These libraries provide a convenient <abbr title="Application Programming Interface">API</abbr> to treat PostgreSQL or SQLite as document stores.
|
||||||
|
|
||||||
|
**BitBadger.Documents** ~ [Documentation][docs-dox] ~ [Git][docs-git]<br>
|
||||||
|
Use for .NET applications (C#, F#)
|
||||||
|
|
||||||
|
**PDODocument** ~ [Documentation][pdoc-dox] ~ [Git][pdoc-git]<br>
|
||||||
|
Use for PHP applications (8.2+)
|
||||||
|
|
||||||
|
**solutions.bitbadger.documents** ~ Documentation _(soon)_ ~ Git _(soon)_<br>
|
||||||
|
Use for <abbr title="Java Virtual Machine">JVM</abbr> applications (Java, Kotlin, Groovy, Scala)
|
||||||
|
|
||||||
|
## Learning
|
||||||
|
|
||||||
|
When we use the term "documents" in the context of databases, we are referring to a database that stores its entries in a data format (usually a form of JavaScript Object Notation, or JSON). Unlike relational databases, document databases tend to have a relaxed schema; often, document collections or tables are the only definition required - and some even create those on-the-fly the first time one is accessed!
|
||||||
|
|
||||||
|
_Documents marked as "wip" are works in progress (i.e., not complete). All of these pages should be considered draft quality; if you are reading this, welcome to the early access program!_
|
||||||
|
|
||||||
|
**[A Brief History of Relational Data][hist]**<br>Before we dig in on documents, we'll take a look at some relational database concepts
|
||||||
|
|
||||||
|
**[What Are Documents?][what]**<br>How documents can represent flexible data structures
|
||||||
|
|
||||||
|
**[Relational / Document Trade-Offs][trade]**<br>Considering the practical pros and cons of different data storage paradigms
|
||||||
|
|
||||||
|
**[Application Trade-Offs][app]**<br>Options for applications utilizing relational or document data
|
||||||
|
|
||||||
|
**[Hybrid Data Stores][hybrid]**<br>Combining document and relational data paradigms _(wip)_
|
||||||
|
|
||||||
|
|
||||||
|
[docs-dox]: ./dotnet/ "BitBadger.Documents • Bit Badger Solutions"
|
||||||
|
[docs-git]: https://git.bitbadger.solutions/bit-badger/BitBadger.Documents "BitBadger.Documents • Bit Badger Solutions Git"
|
||||||
|
[pdoc-dox]: ./php/ "PDODocument • Bit Badger Solutions"
|
||||||
|
[pdoc-git]: https://git.bitbadger.solutions/bit-badger/pdo-document "PDODocument • Bit Badger Solutions Git"
|
||||||
|
[jvm-dox]: ./jvm/ "solutions.bitbadger.documents • Bit Badger Solutions"
|
||||||
|
[jvm-git]: https://git.bitbadger.solutions/bit-badger/solutions.bitbadger.documents "solutions.bitbadger.documents • Bit Badger Solutions Git"
|
||||||
|
[hist]: ./concepts/a-brief-history-of-relational-data.md "A Brief History of Relational Data • Bit Badger Solutions"
|
||||||
|
[what]: ./concepts/what-are-documents.md "What Are Documents? • Bit Badger Solutions"
|
||||||
|
[trade]: ./concepts/relational-document-trade-offs.md "Relational / Document Trade-Offs • Bit Badger Solutions"
|
||||||
|
[app]: ./concepts/application-trade-offs.md "Application Trade-Offs • Bit Badger Solutions"
|
||||||
|
[hybrid]: ./hybrid-data-stores.html "Hybrid Data Stores • Bit Badger Solutions"
|
Loading…
x
Reference in New Issue
Block a user