relational-documents/concepts/document-design-considerations.md

90 lines
10 KiB
Markdown

# Document Design Considerations
When designing any data store, determining how data will be retrieved is often a secondary consideration. Developers get requirements, and we immediately start thinking of how we would store the data that will be produced. Then, when it comes time to search data, produce reports, etc., the process can be painful. We have used the term "consideration" a lot (including in the title of this page!) because there a lot of ways to store the same information. Understanding how that data will be used (and why, and when) can guide design decisions.
As a quick example, consider a customer record. How many addresses will we store for each one? Should they be labeled? Things like state or province are a finite list of choices; do we enforce an accurate selection at the data level? Do we care about addresses that are no longer current? We could end up with anything from a blob of free-form text up to a set of tables, with pieces of the address spread out among them. How these addresses will be used will likely eliminate some options.
No data storage paradigm eliminates these considerations. It may take a bit more time up front, but schema changes and data migration on an operational system can take even more time (and bring complexity that may have been avoided).
## Recognizing Appropriate Relational Data
This will be a short section, as previous articles should have explicitly made the point that not all data is appropriate for a document model. If the importance of relationships between certain entities and other entities must never allow those entities to be out of sync, a document structure is not the best structure for those entities.
## Designing Documents
Having eliminated scenarios where documents are not appropriate, let's design our documents to capture data which fits that paradigm.
### Repeated Data
Many many-to-one relationships in a relational database could be represented as an array of changes in the parent document. Returning to our hotel room example, the rental history of each room could be represented as an array on the room itself. This would give us a quick way to find who was in each room at what point; and, provided our database keys lined up, we could also tell a customer which rooms we charged to their account, and for which dates.
The main question for this structure is this: what other queries against a room would we require? And, given how we could best answer these questions, is an array of reservations the best way to represent that? This is a key consideration for an array-in-document vs. separate multi-entry table decision. Adding a reservation, in an inlined array, is relatively trivial. However, which entity owns the reservation array? Are reservations based on the room, while related to the customer? Are they based on the customer, and associated to the room? Are they an entity unto their own (which represents multiple occurrences with multiple rows vs. inlined in a document)?
In this case, this author would likely have reservations as their own entity, or have reservations inlined in the customer document. It may make sense to split reservations and completed stays into separate arrays; queries for upcoming reservations would likely occur more frequently than those for completed stays, and this would narrow the data set for the former queries to only those reservations that are actually pending.
_(This is not "the right answer"; it is but one way it could be implemented.)_
One area that is more straightforward would be e-mail addresses for our customers. If we want to allow them to have more than one e-mail address on their record, this is easily represented as an inline array in the customer document. While it does mean that we cannot look up a customer by e-mail address using a straight `=` condition, we _can_ store their primary e-mail address as the first entry in this array, and use `email[0]` in cases where we need it.
### Related Data
One theme, underlying all this discussion, is that data is related to other data. These relations are where our next decision point lies. Are these relationships optional? If so, can these optional relationships be defined by their presence? If so, the relationship may be a candidate for a document property instead of a child-table relationship (with or without a foreign key).
Let's think through this reservation scenario a bit more. Most hotel reservations are not made for a specific room; they are usually based on room type (number and configuration of beds, extra space, etc.). The hotel knows how many rooms they have of which type, and what reservations they currently have, so they can give accurate availability numbers. However, they do not usually assign a room number when the reservation is made. This gives them the flexibility to accomodate changes with current customers - say, someone who stays over for an additional 3 days - without being disruptive to either their current customer _or_ the next customer they had assigned to the room that is now occupied.
We may, though, have a few regular customers who stay frequently, and they want a particular room. Since these are our "regulars," we do not want to create a system where we cannot assign a room at reservation time. (Inconveniencing your regular customers is not a recipe for success in any business!)
If we make a reservation its own document, we could have the following properties:
- ID
- Customer ID
- Arrival Date
- Duration of Stay (nights)
- Room Type
- Room ID
- Do Not Move (`true` or `false`, if present)
- Special Instructions
Of these, the first five are required; the first identifies the reservation, the second identifies the customer, and the next three are the heart of the reservation. For most reservations, these would be the only fields in the document (or the others would be `null`). Once rooms are assigned, the room ID would be filled in. However, for our regulars, we would fill it in when they made the reservation, and we would set the "do not move" flag to indicate that this room assignment should not be changed. Special instructions could be anything ("first floor", "near stairs", etc.).
> [!NOTE]
> Although Customer ID is a required field, a document database does not enforce this constraint. Managing these sorts of relationships becomes the responsibility of the application. If this were stored as an array in the customer document, we would not need the Customer ID property, and its presence in their document would establish the relationship.
We can apply this same optional relationship pattern to other documents. Customer service tickets could have an optional Room ID property, which would indicate if a call pertained to a specific room. These tickets could also have an array of log entries with date, user, and a narrative about what happened. This gives us another example of both optional IDs and relationship via containment.
### Domain Objects
Some readers may be thinking "Man, I'm never going to be dealing with data at this level; I just want to store my application's data". In this case, the application's structure takes the lead, and the database is there to support it. (Microsoft's Entity Framework "Code First" pioneered this concept for relational data stores.) When we say "domain object," we mean whatever the application uses to structure its data; it could be a class or a dictionary / associative array.
Storing and retrieving domain objects involves JSON serialization and deserialization. The domain object is serialized to JSON to store it, and deserialized from JSON to reconstitute it in the application. JSON only has six data types - array, object, string, number, boolean, and null - yet it can represent arbitrary structures using just these types.
In these cases, the document's structure will match that of the domain object. Instead of the way an object-relational mapper splits out other objects, arrays, etc., all the information for that domain item is in one document. This means that data access paths match those in your application. `customer.address.city` in your application can be addressed by the JSON path `$.address.city` on the customer document. Assuming the document was in a `customer` table stored in a `data` column, querying the city could be done as follows in both PostgreSQL and SQLite:
```sql
SELECT data FROM customer WHERE data->'addresss'->>'city' = :city
```
> [!NOTE]
> The document libraries hosted here provide the dot-notation access for use in programs; to find all customers in Chicago, the following C# code will generate something that looks a lot like the query above.
>
> ```csharp
> Find.ByFields<Customer>("customer", Field.Equal("address.city", "Chicago"));
> ```
## Conclusion
> If you have read this entire series and arrived here - **THANK YOU**! People like you are the ones this author had in mind when he made the decision to write it.
The main points to take away are:
- Document databases are an interesting and compelling way of structuring data.
- Common relational databases have implemented JSON document columns and functions/operators to manipulate them.
- Using a hybrid approach allows us to avoid some relational pain points (i.e., complexity).
- Documents are not a magic bullet; they still require design considerations.
Documents may not be _the_ solution for your data storage needs - or, they may! - but they are a valuable tool in your collection. JSON document columns in an otherwise-relational table are another interesting option which we did not explore here. There are many ways to incorporate the good parts of documents to reduce complexity, and you are probably already using a database which support them.
The libraries linked across the top of the page provide an easy, document-database style interface for storing documents in PostgreSQL and SQLite. They also provide a custom mapping function interface against database results (`Npgsql.FSharp` for F#, `ADO.NET` for C#, `PDO` for PHP, and `JDBC` for JVM languages). Instead of creating a connection; creating a command; setting up the query; iteratively binding parameters; executing the query; and looping through the results, these take a query, a collection of parameters, and a mapping function - all that other work is done, but the libraries abstract that away.
Whether these libraries find their way into your tool belt or not, we hope you have gained knowledge. When we reduce complexity - leading to applications which are more robust, reliable, and maintainable - everybody wins!