Data & Data Scalability

Indexing and Query Optimization

Learning Objectives

You can create indexes in PostgreSQL.
You know of basic query optimization techniques.
You understand the effect of indexing on query performance.

Indexing

Indexing is a basic database optimization technique for improving query performance. Indexing involves creating separate search structures to the database that make searching data faster. The most common index type is the B-tree index, which is a balanced tree structure that allows for fast lookups, insertions, and deletions.

Other index types

There are a range of index types in PostgreSQL, which are documented in the PostgreSQL documentation. The most common index types are B-tree, Hash, GIN, BRIN, and GiST. Each index type has its own strengths and weaknesses — B-trees work quite well for the majority of the cases.

Indexes are created using the CREATE INDEX command, which specifies the table and column(s) to index. Indexes can be created on single or multiple columns, and can be used to enforce unique constraints or primary keys.

Let’s consider the following todo table (the same in the first database migration file of the walking skeleton):

CREATE TABLE todos (
  id SERIAL PRIMARY KEY,
  name VARCHAR(255) NOT NULL,
  done BOOLEAN NOT NULL DEFAULT false
);

As an example, to add an index on the name column, we can use the following command:

CREATE INDEX idx_todos_name ON todos(name);

Similarly, if we would wish to add an index on the done column, we could use the following command:

CREATE INDEX idx_todos_done ON todos(done);

Indexes can also be created on expressions, which can be useful for indexing computed columns or partial indexes. For example, if we would have a database table called users with an email column, we could create an index on the email column that would ensure that email addresses are unique and case-insensitive:

CREATE UNIQUE INDEX idx_users_email_unique ON users(LOWER(email));

In the above example, the LOWER(email) expression is used to convert the email address to lowercase before indexing it.

Indexes can also be used to create partial indexes, which index only a subset of the rows in a table. Partial indexes can be useful for optimizing queries that only access a subset of the data in a table. For example, if we have a table called users with a status column that can have values of active or inactive, we could create a partial index on the email column that only indexes rows where the status column is active:

CREATE INDEX idx_users_email_active ON users(email) WHERE status = 'active';

Index-only access patterns

It is also possible to create indexes that cover all columns needed by specific queries, enabling data retrieval directly from the index. As an example, a library system might create an index on the isbn, title, and author columns so that search queries can return book information without accessing the main books table.

Loading Exercise...

Query Optimization

Indexing is a key aspect of query optimization, aimed at enhancing database query performance. Effective optimization leads to faster response times, reduced resource usage, and better scalability.

In PostgreSQL, the query planner/optimizer automatically optimizes queries by evaluating various execution plans and selecting the most efficient one based on factors like table size, index availability, and data distribution.

EXPLAIN and EXPLAIN ANALYZE

Query plans can be viewed using the EXPLAIN and EXPLAIN ANALYZE commands: EXPLAIN displays the execution plan for a query without running it, while the EXPLAIN ANALYZE command executes the query and provides the actual execution plan along with real runtime statistics.

For example, the output of the EXPLAIN command for a query might look like this:

QUERY PLAN
---------------------------------------------------------------------------
Index Scan using idx_users_name on users  (cost=0.42..8.44 rows=1 width=15)
  Index Cond: (name = 'User 999983'::text)
(2 rows)

The values can be interpreted as follows:

Index Scan: The query uses an index to retrieve the data.
idx_users_name: The name of the index used.
cost=0.42..8.44: The estimated cost of the query execution (start-up and total cost).
rows=1: The estimated number of rows returned by the query.
width=15: The estimated width of the rows returned by the query.

While the cost is an arbitrary number, it can be used to compare the relative efficiency of different query plans. Lower costs generally indicate more efficient query plans. The EXPLAIN ANALYZE command provides additional information, such as the actual number of rows processed and the time taken to execute the query.

Common Query Optimization Techniques

The most common query optimization techniques are as follows:

Retrieve only needed data by specifying the columns you need in the SELECT statement instead of using SELECT *.
Using indexes to speed up data retrieval by creating indexes on columns that are frequently used in queries.
Filter early by applying filters (i.e., WHERE) as early as possible in the query to reduce the number of rows processed.
Optimize joins by ensuring that join conditions are efficient and that the tables being joined are properly indexed.
Denormalize data by storing redundant data to avoid expensive joins, especially in read-heavy applications.
Use query caching to store the results of frequently executed queries and avoid re-executing them.

Loading Exercise...

Effect of Indexing on Query Performance

Let’s look into the effect of indexing on query performance. We’ll start by creating a table called users and populating it with a hundred thousand users. In your walking skeleton, create a migration file V3__users.sql with he following contents.

CREATE TABLE users (
  id SERIAL PRIMARY KEY,
  name TEXT NOT NULL
);

INSERT INTO users (name) SELECT 'User ' || (n)::text FROM generate_series(1, 100000) n;

The above migration file creates a table users with an id column as the primary key and a name column. It then inserts a hundred thousand users into the table, where each user has a unique name.

Then, run the migration to create the table and insert the data by restarting the application.

Query without index

Now that we have data in the database, let’s look into two queries. In the first query, we look for an user using an identifier.

EXPLAIN SELECT * FROM users WHERE id = 42;
                               QUERY PLAN
-------------------------------------------------------------------------
 Index Scan using users_pkey on users  (cost=0.29..8.31 rows=1 width=14)
   Index Cond: (id = 42)

The above query uses an index to find the user with the identifier 42. The query plan shows that the query uses the users_pkey index to retrieve the data, which is the primary key index on the id column.

PostgreSQL automatically creates an index on the primary key column(s) of a table, which is why the query uses an index.

Now, let’s look into another query that looks for an user using the name.

EXPLAIN SELECT * FROM users WHERE name = 'User 42';
                       QUERY PLAN
---------------------------------------------------------
 Seq Scan on users  (cost=0.00..1791.00 rows=1 width=14)
   Filter: (name = 'User 42'::text)

The second query would perform a sequential scan of the data to find the user with the name User 42.

A sequential scan reads all the rows in a table to find the matching rows, which can be slow for large tables. In contrast, an index scan uses an index to quickly locate the matching rows, which is much faster.

While the above examples look at the query plan, we can also use EXPLAIN ANALYZE to concretely see the time it takes to execute the query.

First, for the query that looks for the user with the identifier 42, the query takes approximately 0.2 ms to execute and approximately 0.1 ms to plan.

EXPLAIN ANALYZE SELECT * FROM users WHERE id = 42;
                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Index Scan using users_pkey on users  (cost=0.29..8.31 rows=1 width=14) (actual time=0.179..0.182 rows=1 loops=1)
   Index Cond: (id = 42)
 Planning Time: 0.111 ms
 Execution Time: 0.213 ms

At the same time, the query that looks for the user with the name User 42 takes approximately 12 ms to execute and approximately 0.1 ms to plan.

EXPLAIN ANALYZE SELECT * FROM users WHERE name = 'User 42';
                                             QUERY PLAN
----------------------------------------------------------------------------------------------------
 Seq Scan on users  (cost=0.00..1791.00 rows=1 width=14) (actual time=0.041..12.121 rows=1 loops=1)
   Filter: (name = 'User 42'::text)
   Rows Removed by Filter: 99999
 Planning Time: 0.094 ms
 Execution Time: 12.182 ms

Total query time and Prepared Statements

The time that a query takes consists of the planning time and the execution time. The planning time is the time it takes to create the query plan, while the execution time is the time it takes to execute the query.

Prepared statements can be used to reduce the planning time, as the query plan is created only once and can be reused for multiple executions. When working with PostgreSQL, the PREPARE keyword is used to create a prepared statement, and the EXECUTE keyword is used to execute it.

The Postgres.js library that we have used automatically creates prepared statements for queries.

Adding an index on name

Let’s next add an index to the table users that indexes the users based on the name. Create a file database migration file called V4__users_name_index.sql with the following contents.

CREATE INDEX idx_users_name ON users(name);

And then, restart the application (or run the migrations).

Once we have created the index, the query used for looking for the user with the specific name uses an index.

EXPLAIN SELECT * FROM users WHERE name = 'User 42';
                                 QUERY PLAN
-----------------------------------------------------------------------------
 Index Scan using idx_users_name on users  (cost=0.42..8.44 rows=1 width=14)
   Index Cond: (name = 'User 42'::text)

When looking concretely at the performance, the query that looks for the user with the name User 42 is now also significantly faster to execute. Instead of the 12.1 ms, the execution time is approximately 0.2 ms.

EXPLAIN ANALYZE SELECT * FROM users WHERE name = 'User 42';
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_users_name on users  (cost=0.42..8.44 rows=1 width=14) (actual time=0.124..0.127 rows=1 loops=1)
   Index Cond: (name = 'User 42'::text)
 Planning Time: 0.119 ms
 Execution Time: 0.154 ms

More data = slower scans

In the above example, we have 100,000 users. If the amount of users would be tenfold, the sequential scan would take approximately ten times longer (with plenty of data, PostgreSQL starts to parallelize the scans though). The index scan, however, would still be fast, as it is not directly dependent on the number of rows in the table.

Loading Exercise...

Extending to many-to-many relationships

Let’s continue the example by adding a table items and a many-to-many relationship between the tables users and items. There will be a thousand items in the database, and in addition, each user will have five items linked to them through a table users_to_items.

Create a migration file V5__items_and_users_to_items.sql with the following contents.

CREATE TABLE items (
  id SERIAL PRIMARY KEY,
  name TEXT NOT NULL
);

CREATE TABLE users_to_items (
  user_id INTEGER NOT NULL REFERENCES users(id),
  item_id INTEGER NOT NULL REFERENCES items(id)
);

INSERT INTO items (name) SELECT 'Item ' || (n)::text FROM generate_series(1, 1000) n;

WITH random_items AS (
  SELECT
    u.id AS user_id,
    i.id AS item_id,
    ROW_NUMBER() OVER (PARTITION BY u.id ORDER BY RANDOM()) AS rn
  FROM
    users u
  CROSS JOIN
    items i
)
INSERT INTO users_to_items (user_id, item_id)
SELECT
  user_id,
  item_id
FROM
  random_items
WHERE
  rn <= 5;

The migration will take quite a while to run, so you might want to grab a coffee while waiting. If it feels that it takes too long, you may also reduce the number of items and users.

As the extension auto_explain is enabled, you’ll also see the slow running query in the server logs.

Findings items for a specific user

Finding all items that belong to a specific user would be done with a join that first links the items table to the users_to_items table, and then further links the table to the users table (or the other way around). As an example, finding all items linked to a user with a specific name would be written as follows — below, we use User 42 as the name.

SELECT items.name FROM items
  JOIN users_to_items ON users_to_items.item_id = items.id
  JOIN users ON users.id = users_to_items.user_id
  WHERE users.name = 'User 42';

When we look into how the query is conducted, we observe potential room for improvement.

EXPLAIN ANALYZE SELECT items.name FROM items
  JOIN users_to_items ON users_to_items.item_id = items.id
  JOIN users ON users.id = users_to_items.user_id
  WHERE users.name = 'User 42';
                                                                   QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1008.72..6936.09 rows=5 width=8) (actual time=0.426..30.207 rows=5 loops=1)
   Workers Planned: 1
   Workers Launched: 1
   ->  Nested Loop  (cost=8.72..5935.59 rows=3 width=8) (actual time=11.030..24.874 rows=2 loops=2)
         ->  Hash Join  (cost=8.45..5934.72 rows=3 width=4) (actual time=11.025..24.862 rows=2 loops=2)
               Hash Cond: (users_to_items.user_id = users.id)
               ->  Parallel Seq Scan on users_to_items  (cost=0.00..5154.18 rows=294118 width=8) (actual time=0.008..11.375 rows=250000 loops=2)
               ->  Hash  (cost=8.44..8.44 rows=1 width=4) (actual time=0.050..0.051 rows=1 loops=2)
                     Buckets: 1024  Batches: 1  Memory Usage: 9kB
                     ->  Index Scan using idx_users_name on users  (cost=0.42..8.44 rows=1 width=4) (actual time=0.043..0.045 rows=1 loops=2)
                           Index Cond: (name = 'User 42'::text)
         ->  Index Scan using items_pkey on items  (cost=0.28..0.29 rows=1 width=12) (actual time=0.003..0.003 rows=1 loops=5)
               Index Cond: (id = users_to_items.item_id)
 Planning Time: 0.393 ms
 Execution Time: 30.286 ms

One of the first observations is that the users_to_items table is scanned parallelly to find mathing rows — Parallel Seq Scan on users_to_items. Although one might assume that a table with values that reference other tables would automatically be indexed, this is not the case. We can look at the table definition using the \d command in PostgreSQL.

\d users_to_items
           Table "public.users_to_items"
 Column  |  Type   | Collation | Nullable | Default
---------+---------+-----------+----------+---------
 user_id | integer |           | not null |
 item_id | integer |           | not null |
Foreign-key constraints:
    "users_to_items_item_id_fkey" FOREIGN KEY (item_id) REFERENCES items(id)
    "users_to_items_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id)

There are no mentions of indexes in the table definition, so of course, no index is used.

Adding an index to the users_to_items table

Let’s add an index to the table users_to_items, where the index is defined on both columns (i.e., it is a multi-column index). Create a migration file V6__users_to_items_index.sql with the following contents.

CREATE INDEX idx_users_to_items ON users_to_items (user_id, item_id);

Then, run the migration to create the index. Now, when we look at the table in the database, we see that the index is in place.

\d users_to_items
           Table "public.users_to_items"
 Column  |  Type   | Collation | Nullable | Default
---------+---------+-----------+----------+---------
 user_id | integer |           | not null |
 item_id | integer |           | not null |
Indexes:
    "idx_users_to_items" btree (user_id, item_id)
Foreign-key constraints:
    "users_to_items_item_id_fkey" FOREIGN KEY (item_id) REFERENCES items(id)
    "users_to_items_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id)

Now, with the index in place, the query is faster.

EXPLAIN ANALYZE SELECT items.name FROM items
  JOIN users_to_items ON users_to_items.item_id = items.id
  JOIN users ON users.id = users_to_items.user_id
  WHERE users.name = 'User 42';
                                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=1.11..14.46 rows=5 width=8) (actual time=0.086..0.144 rows=5 loops=1)
   ->  Nested Loop  (cost=0.84..13.00 rows=5 width=4) (actual time=0.080..0.084 rows=5 loops=1)
         ->  Index Scan using idx_users_name on users  (cost=0.42..8.44 rows=1 width=4) (actual time=0.063..0.064 rows=1 loops=1)
               Index Cond: (name = 'User 42'::text)
         ->  Index Only Scan using idx_users_to_items on users_to_items  (cost=0.42..4.51 rows=5 width=8) (actual time=0.012..0.014 rows=5 loops=1)
               Index Cond: (user_id = users.id)
               Heap Fetches: 0
   ->  Index Scan using items_pkey on items  (cost=0.28..0.29 rows=1 width=12) (actual time=0.011..0.011 rows=1 loops=5)
         Index Cond: (id = users_to_items.item_id)
 Planning Time: 1.094 ms
 Execution Time: 0.181 ms

The execution time drops from approximately 30 ms to 0.2 ms — a significant improvement. At the same time, we also notice that the planning time increases — this is due to the query planner needing to consider more options when creating the query plan.

Loading Exercise...

Database benchmarking

Above, we’ve used EXPLAIN and EXPLAIN ANALYZE directly in the database, doing one-shot testing. In reality, experiments that look into changes in the database should be done in a more controlled manner, where the database is benchmarked under a set of given conditions to understand the performance implications of changes. For this, there exists tools like pgbench.

It is also possible to use pg_stat_statements to study the performance of queries over time. When looking into monitoring later on in the course, we’ll also look into constructing a dashboard that visualizes database and query performance.

Loading Exercise...

← Database Setup, Studying Queries, and Connection Pools

Schema Design, Normalization, and Denormalization →