Over roughly 35 years of relational database history, the question of proper design has always been a contentious one. Picking the most contentious database design topics to write about is no difficult feat. Normalization. It sounds like a cold, dry concept, and in fact…it is. Yet, at its core, normalization is simply about making sure a relational database design makes sense, represents what the designer expected, and follows a pattern of implementation that matches the needs of the RDBMS engine. When we normalize a database, the process works to eliminate redundant and overcomplicated data in multiple ways, making sure your rows and columns have a single meaning.

In this short article, we can’t, for obvious reasons, formally define the normal forms, but we can define a quick overview of the goals:

  • Column values should represent only one value, and should not need to be broken down by the user for use
  • Rows are unique from one another (artificial, surrogate keys don't count)
  • Rows represent the same number of items (for example, no arrays in a column or set of columns)
  • Columns are either a key to identify a row, or they are pieces of information describing what the entire key identifies
  • A row represents just one thing, which is evidenced by proper relationships between key columns

The true issue has been that normalization was not often treated as a practical task, but as an academic one. It requires some academic-style thinking, but really, normalization is to database design what modularization is to procedural programming; a core skill that is a necessity for everyone. A database that is not normalized would have one table with one column. Anything more of either and you are somewhat normalized.

As an example, consider modeling a classroom of students at the PASS Summit. In the classroom we will have a few possible entities that we might use to model this classroom:

  • Chair
  • Student

If you consider for a moment, the concept of a chair in a PASS Summit data model, there are a few ways we could consider tackling this design:

  1. As a table, where a row represents a chair in the room. We would store basic attributes of the chair, like type of seating, location in the room, etc.
  2. As a table, with a row representing a chair, with advanced details, we would likely store the model number of the chair, the color of the chair, etc. The model number of the chair is what is referred to as a smart key, giving the company quick access to the style, color, year of manufacture, parts lists etc. about the chair. Everything in the model number should be expanded into individual columns. Every chair shall be identified by a serial number and asset tag.
  3. Not as a table at all, but rather a column containing a simple count of seating locations as an attribute of the room. We may also want to break down types of seating, like regular and accessible.

Intuitively, you probably already believe you know the right answer, or at least, the most likely very wrong answer, but like a NULL column value, the correct answer is unknown. The only requirement stated is: we want to model a classroom, and we know from experience that there are chairs in a room. Possibility #2, where we have broken down every detail of a chair into lots of detailed columns, is often mistaken for a truly well normalized solution. Rather it is an overengineered solution for the requirements given. First, because we don't know what the customer wants, and second, because having a value like model number that is in the manufacturer's system as a smart key doesn't mean we have to care about those details in our database, unless the customer needs those details for some reason.

In the end, normalization and the customer's desires are interlocking concepts. A normalized database has exactly the right number of tables and columns to describe what the customer wants in a way that is easiest to maintain their data. And that will always be relevant.