Data & Data Scalability

Schema Design, Normalization, and Denormalization

Learning Objectives

You know the basic principles of schema design and normalization.
You understand the trade-offs and considerations involved in denormalization.
You are familiar with common denormalization patterns used to optimize database performance.
You know how to evaluate the impact of denormalization on database performance.

We’ll first briefly discuss schema design principles, after which we go into normalization and denormalization. For a more in-depth discussion on schema design and normalization, look for an introductory databases course.

Schema Design Principles

Schema design is the starting point of building scalable and maintainable databases. The initial step involves forming an understanding of data requirements, which includes identifying business objectives, defining use cases, and gathering data needs from stakeholders.

Although schema design is typically discussed in introductory databases courses, it is a critical skill also for web developers. This holds also when using schema-less databases like MongoDB, as schema design activities help understand the data.

As an example, we might have a scenario where a system keeps track of courses, course exercises, students, and exercise submissions. A one-table solution might have the following columns:

course_name
exercise_name
student_name
submission_data
submission_created_at
submission_passed

As we know from introductory databases courses, such a design does not follow normalization principles, which aims to reduce redundancy and improve data integrity. A better solution could be to use four tables to represent the entities: Course, Exercise, Student, and CourseExerciseSubmission. The CourseExerciseSubmission table would have foreign keys to the Course, Exercise, and Student tables, as well as columns for data, created_at, and passed.

As an Entity-Relationship (ER) diagram, such a schema could be represented as shown in Fig. 1. below.

Figure 1 — Entity-Relationship Diagram illustrating the relationships between Courses, Exercises, Students, and Course Exercise Submissions.

Entity-Relationship modeling

Entity-Relationship (ER) modeling is a key part of schema design. ER modeling defines entities, their attributes, and the relationships between them, providing a clear blueprint for the database structure. ER diagrams offer a visual representation that makes it easier to comprehend the data structures and their connections.

The above schema still has problems, like multiple students potentially having the same name and using strings as primary keys. To resolve the issues, we could introduce integer primary keys for the Course, Exercise, Student, and CourseExerciseSubmission tables, and refer to them using foreign keys in the CourseExerciseSubmission table. The updated schema is as shown in Fig. 2.

Figure 2 — Entity-Relationship Diagram illustrating the relationships between Courses, Exercises, Students, and Course Exercise Submissions with integer primary keys.

The foreign keys ensure that the relationships between the entities are maintained and maintain referential integrity. As an example, a COURSE_EXERCISE_SUBMISSION record cannot exist without a corresponding COURSE, EXERCISE, and STUDENT record, which is enforced by the foreign key constraints — in the one-table solution, making typos when adding a submission could have lead to orphaned records.

Normalization

The transition from the single-table database to the multi-table database is an example of normalization, which is also discussed in introductory databases courses. Normalization is a systematic methodology for organizing data within a database to minimize redundancy and eliminate undesirable anomalies — it is often practiced through a series of stages known as normal forms, each addressing specific types of data anomalies:

First Normal Form (1NF): Ensures that the data is atomic, meaning each column contains indivisible values, and eliminates repeating groups or arrays within tables.
Second Normal Form (2NF): Builds on 1NF by ensuring that all non-key attributes are fully functionally dependent on the entire primary key, addressing partial dependencies.
Third Normal Form (3NF): Further refines the structure by eliminating transitive dependencies, ensuring that non-key attributes depend solely on the primary key and not on other non-key attributes.
Boyce-Codd Normal Form (BCNF): Addresses more complex dependency scenarios, ensuring that every determinant is a candidate key, thereby resolving anomalies that 3NF might not cover.

The advantages of normalization include enhanced data integrity, efficient storage utilization, and simplified maintenance. Normalized schemas make operations such as updates, inserts, and deletions less error-prone, as there is minimal data redundancy and fewer opportunities for inconsistencies.

Need for Denormalization

Normalization comes also with challenges — a highly normalized schemas can lead to increased query complexity, as retrieving related data often requires table joins.

As an example, if we would wish to show a list of all submissions for a specific student in the above case, we would need to join the COURSE_EXERCISE_SUBMISSION table with the COURSE, EXERCISE, and STUDENT tables. Now, imagine a situation where there are tens of millions of submissions, hundreds of thousands of students, and thousands of exercises. Already joining the tables could be computationally expensive — at least when compared to a single-table with no joins.

Multiple table joins can result in performance issues, particularly in read-heavy applications where frequent access to combined data is essential.

Denormalization involves intentionally introducing redundancy into a database schema to simplify queries and enhance read performance. By reducing the number of joins required in queries, denormalization can speed up retrieval, which is crucial for high-traffic applications. It is used, for example, in:

E-commerce Product Catalogs: Rapid retrieval of product information is essential for a seamless shopping experience.
Social Media Feeds: Quick access to user-generated content enhances user engagement.
Analytics Dashboards: Fast aggregation and reporting are critical for timely insights.

In such cases, the performance gains from denormalization often outweigh the challenges associated with maintaining redundant data.

Trade-Offs and Considerations

Denormalization involves trade-offs. While it can enhance read performance, it may compromise data integrity and increase storage overhead. Additionally, write operations can become more complex, as updates must be propagated across multiple redundant entries to maintain consistency.

Determining when to denormalize involves identifying specific conditions in which redundancy can help improve the application’s performance. Key considerations include:

Query Patterns: Analyzing the types of queries that are frequently executed can reveal opportunities for denormalization. If certain queries involve multiple joins or complex aggregations, denormalization may be beneficial.
Data Access Patterns: Understanding how data is accessed and modified can guide denormalization decisions. If read operations significantly outnumber write operations, denormalization can optimize read performance.
Maintenance Overhead: Denormalized schemas can be more complex to manage and maintain, as updates and deletions must be carefully coordinated across redundant entries.

Evaluating the costs and benefits of denormalization efforts is required before implementing changes. Here, again, data from actual usage is the key — informed decisions cannot be made without understanding the actual database use.

Loading Exercise...

Common Denormalization Patterns

Denormalization optimizes database performance by intentionally introducing redundancy. The most common patterns include:

Data Duplication: Replicating frequently accessed data across multiple tables to eliminate costly joins. As an example, an e-commerce platform might store the customer’s name and address directly in the orders table to quickly display order details without joining the customers table.
Aggregate Tables: Creating summary tables with precomputed totals or metrics for fast retrieval. As an example, a sales dashboard may use an aggregate table that stores monthly sales totals per region, enabling quick generation of reports without recalculating from transaction data each time.
Materialized Views: Storing the results of complex queries to speed up access. As an example, a financial application might use a materialized view to maintain up-to-date portfolio valuations, avoiding the need to compute them from numerous transactions on every request.
Embedding Hierarchical Data: Storing nested or related data within a single table to simplify access. As an example, a blog system could embed tags to a JSON field of each blog entry, allowing retrieving both the post and the tags without additional joins.
Hybrid Models: Combining normalized and denormalized structures to leverage the benefits of both. As an example, a system may use a normalized schema for inventory transactions to ensure data integrity, while maintaining denormalized tables for daily sales reports to enhance performance.

Effect of Denormalization

To concretely illustrate the effect of denormalization, let’s continue with the users and items example. Imagine a situation, where the application would often need information on how many users each item has. In a normalized schema, this would require joining the items and users_to_items tables.

SELECT items.name, COUNT(*) AS user_count
  FROM items
  JOIN users_to_items
    ON items.id = users_to_items.item_id
  GROUP BY items.id
  LIMIT 10;

When we study the performance of the above query, we see that it could be more efficient — there’s quite a bit of work going on to achieve the result.

EXPLAIN ANALYZE SELECT items.name, COUNT(*) AS user_count
  FROM items
  JOIN users_to_items
    ON items.id = users_to_items.item_id
  GROUP BY items.id
  LIMIT 10;
                                                                            QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=8488.44..8489.74 rows=10 width=20) (actual time=61.012..63.317 rows=10 loops=1)
   ->  Finalize GroupAggregate  (cost=8488.44..8618.44 rows=1000 width=20) (actual time=61.010..63.313 rows=10 loops=1)
         Group Key: items.id
         ->  Gather Merge  (cost=8488.44..8603.44 rows=1000 width=20) (actual time=61.002..63.303 rows=21 loops=1)
               Workers Planned: 1
               Workers Launched: 1
               ->  Sort  (cost=7488.43..7490.93 rows=1000 width=20) (actual time=57.454..57.469 rows=506 loops=2)
                     Sort Key: items.id
                     Sort Method: quicksort  Memory: 64kB
                     Worker 0:  Sort Method: quicksort  Memory: 64kB
                     ->  Partial HashAggregate  (cost=7428.60..7438.60 rows=1000 width=20) (actual time=57.208..57.278 rows=1000 loops=2)
                           Group Key: items.id
                           Batches: 1  Memory Usage: 129kB
                           Worker 0:  Batches: 1  Memory Usage: 129kB
                           ->  Hash Join  (cost=28.50..5958.01 rows=294118 width=12) (actual time=0.521..36.000 rows=250000 loops=2)
                                 Hash Cond: (users_to_items.item_id = items.id)
                                 ->  Parallel Seq Scan on users_to_items  (cost=0.00..5154.18 rows=294118 width=4) (actual time=0.006..8.893 rows=250000 loops=2)
                                 ->  Hash  (cost=16.00..16.00 rows=1000 width=12) (actual time=0.470..0.471 rows=1000 loops=2)
                                       Buckets: 1024  Batches: 1  Memory Usage: 55kB
                                       ->  Seq Scan on items  (cost=0.00..16.00 rows=1000 width=12) (actual time=0.025..0.194 rows=1000 loops=2)
 Planning Time: 0.363 ms
 Execution Time: 63.425 ms

The query takes approximately 0.4ms to plan, and 63ms to execute. If the query would be executed frequently, especially with more data, the query could become a bottleneck. Let’s see what would happen if the items table would already contain information on how many users each item has.

Create a migration file V7__denormalize_user_counts_to_items.sql with the following content:

ALTER TABLE items
  ADD COLUMN user_count INT DEFAULT 0;

WITH item_user_counts AS (
  SELECT item_id, COUNT(*) AS user_count
  FROM users_to_items
  GROUP BY item_id
)
UPDATE items
  SET user_count = item_user_counts.user_count
  FROM item_user_counts
  WHERE items.id = item_user_counts.item_id;

After running the migration, the query to get the user counts for items becomes much simpler:

SELECT items.name, user_count FROM items LIMIT 10;

And, significantly faster to execute. Below, the query takes approximately 0.05ms to execute, which is a significant improvement over the previous query with approximately 63ms execution time.

EXPLAIN ANALYZE SELECT items.name, user_count FROM items LIMIT 10;
                                                QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..0.22 rows=10 width=12) (actual time=0.019..0.023 rows=10 loops=1)
   ->  Seq Scan on items  (cost=0.00..22.00 rows=1000 width=12) (actual time=0.018..0.020 rows=10 loops=1)
 Planning Time: 0.247 ms
 Execution Time: 0.047 ms

As the downside, whenever an item is assigned to a user, the user_count field must be updated. This can be done in the same transaction that assigns the item to the user, but it still adds complexity to the application.

Loading Exercise...

← Indexing and Query Optimization

Caching →