Breaking The Relational Model
Complete Developer Podcast - En podcast af BJ Burns and Will Gant - Torsdage
The rule: Normalize until it hurts; denormalize until it works. That’s what you typically get taught early on. While normalizing a database is a solid practice and something you should do when designing a system; it can jump up and bite you when you get under load. You probably won’t deal with most of these issues early on in a system (and shouldn’t), but the time will come where eventually some degree of denormalization is necessary to keep performance up to snuff. On average if you start suggesting denormalizing a system, you will have to be able to show that it improves performance significantly. There are good reasons that systems aren’t denormalized and people will typically balk at doing so. However, there are places where under some circumstances it’s worth considering. After all, CPU cycles on a database server are often some of the most expensive ones out there – go check your SQL Server or Oracle licensing fees for proof. Rules are made to be followed in general, but sometimes you have to bend them just to get by. This is especially in computer science, where many “rules” or more like “guidelines for most cases” at best. Episode Breakdown When additional join tables will hurt performance. Joins are not cheap or free. Complex joins are especially bad (for instance, a function call in a join). This is especially true when you are joining across a lot of tables and aggregating a lot of results on a regular basis for data that doesn’t frequently change. When you need to keep historical database records. Consider a system that manages invoices and products. You don’t want to mess up historical data on invoices from last year when a product price changed today. Typically historical tables are also going to be used mostly for reads, so you want to cut down the number of joins just to make that faster. Tables with historical data can often be more heavily indexed than you might see in transactional tables that are used in online transaction processing that has to be fast. When data needs to be structured for reporting purposes. Reporting systems generally have a lot of data duplication simply so that more interesting trends can be spotted with the data at hand. You’ll also have very different indexing requirements for reporting tables. The way the tables are accessed is also very different for reporting. They tend to be read heavy and are going to need different levels of locking depending on what you’re doing. When the data itself is hierarchical in structure. Hierarchical data is difficult to represent efficiently in a relational model. This is especially true when you need to rearrange a tree, or determine if something is an ancestor of something else, especially if the structure has an arbitrary depth. It’s also very difficult to write logic for tree structures stored in a relational model such that you don’t accidentally create cycles in the data. When the data is a graph structure. It’s even more fun when it’s a graph, because cycles are actually possible. While storing nodes and edges is easy enough, doing things like finding the shortest path will be difficult. You’ll also have a difficult time proving that any two nodes are connected without a serious risk of major performance problems. When you have very large records. Relational database have issues with large records, whether you query them or not. They aren’t really designed for holding things like images and huge chunks of text. This is especially true if you want to index and search through this content, as a lot of the search capabilities of a relational database are not really built for this purpose. If you are storing large records,