Database indexes

Learning objectives

You know of the importance of indexes for tweaking database query performance.
You know how to study the performance of database queries.
You know that database management systems such as PostgreSQL provide tools for studying query performance.

When we consider a simple database, there are a handful of approaches that can be taken to improve performance. These include (1) caching database query results to avoid re-reading data that hasn't changed, (2) designing and applying meaningful database indexes to allow faster searching of information, (3) data denormalization to avoid expensive joins, and (4) table partitioning. Here, we'll briefly look into the effect of indexing on query performance.

Let's start by looking into a simple situation with a table called users.

CREATE TABLE users (
  id SERIAL PRIMARY KEY,
  name TEXT NOT NULL
);

To start with a meaningful amount of data, let's add a million users to the database.

INSERT INTO users (name) SELECT 'User ' || (n)::text FROM generate_series(1, 1000000) n;

When we access the database, the database looks as follows.

SELECT COUNT(*) FROM users;
  count  
---------
 1000000
(1 row)

SELECT * FROM users LIMIT 2;
 id |  name  
----+--------
  1 | User 1
  2 | User 2
(2 rows)

EXPLAIN

The EXPLAIN command can be used to see the query plan of a database query. It will highlight if queries tables are looked into using a sequential scan, and whether there are any indexes at play.

Now that we have data in the database, let's look into two queries. In the first query, we look for an user using an identifier, while in the second query, we look for a user using a name.

EXPLAIN SELECT * FROM users WHERE id = 999983;
                               QUERY PLAN                                
-------------------------------------------------------------------------
 Index Scan using users_pkey on users  (cost=0.42..8.44 rows=1 width=15)
   Index Cond: (id = 999983)
(2 rows)

EXPLAIN SELECT * FROM users WHERE name = 'User 999983';
                               QUERY PLAN                                
-------------------------------------------------------------------------
 Gather  (cost=1000.00..11614.43 rows=1 width=15)
   Workers Planned: 2
   ->  Parallel Seq Scan on users  (cost=0.00..10614.33 rows=1 width=15)
         Filter: (name = 'User 999983'::text)
(4 rows)

As we can see, the first query uses an index, while the second query performs a sequential scan of the data (although in parallel). The queries also outline a cost of the query, which is an estimate (or a guess) produced by the PostgreSQL query planner. The cost is an arbitrary number -- the first number highlights a start-up cost, while the second number returns the total cost to return all the rows.

The cost of the first query is smaller than the cost of the second query. Let's add an index to the table users that indexes the users based on the name.

CREATE INDEX idx_users_name ON users(name);

Index types

There are a handful of index types that we can choose from. By default, PostgreSQL uses the B-Tree index.

Once we have created the index, the query used for looking for the user with the specific name is considerably faster as it uses the idx_users_name index.

EXPLAIN SELECT * FROM users WHERE name = 'User 999983';
                                 QUERY PLAN                                  
-----------------------------------------------------------------------------
 Index Scan using idx_users_name on users  (cost=0.42..8.44 rows=1 width=15)
   Index Cond: (name = 'User 999983'::text)
(2 rows)

Let's continue the example by adding a table items, and by adding a many-to-many relationship between the tables users and items. The relationship is stored in a table users_to_items, which has foreign keys to the users and items tables.

CREATE TABLE items (
  id SERIAL PRIMARY KEY,
  name TEXT NOT NULL
);

CREATE TABLE users_to_items (
  user_id INTEGER NOT NULL REFERENCES users(id),
  item_id INTEGER NOT NULL REFERENCES items(id)
);

Let's further add a few items to the database table items. The below query is a replica of the query that we used to populate the users table with a million items.

INSERT INTO items (name) SELECT 'Item ' || (n)::text FROM generate_series(1, 1000000) n;

When we have some users and items, let's then add a few items to each user.

INSERT INTO users_to_items (user_id, item_id)
SELECT u.id, i.id
FROM (
  SELECT id FROM users
) u
CROSS JOIN (
  SELECT id FROM items
  ORDER BY RANDOM()
  LIMIT 5
) i;

In the above query, we select all the identifiers of the users to create a temporary table u and select five randomly picked identifiers from the items table to create a temporary table i. We then perform a cartesian join on the tables u and i, and add the resulting data to the users_to_items table. In effect, this leads to adding five million rows to the table users_to_items.

Finding all the items that belong to a specific user would be done with a join that first links the items table to the users_to_items table, and then further links the table to the users table (or the other way around). As an example, finding all items linked to a user with a specific name would be written as follows -- below, we use User 999983 as the name.

SELECT items.name FROM items 
  JOIN users_to_items ON users_to_items.item_id = items.id 
  JOIN users ON users.id = users_to_items.user_id 
  WHERE users.name = 'User 999983';

When we look into how the query is conducted, we observe potential room for improvement.

EXPLAIN SELECT items.name FROM items 
  JOIN users_to_items ON users_to_items.item_id = items.id 
  JOIN users ON users.id = users_to_items.user_id 
  WHERE users.name = 'User 999983';
                                             QUERY PLAN                                             
----------------------------------------------------------------------------------------------------
 Gather  (cost=1008.88..49435.96 rows=5 width=11)
   Workers Planned: 2
   ->  Nested Loop  (cost=8.88..48435.46 rows=2 width=11)
         ->  Hash Join  (cost=8.46..48434.56 rows=2 width=4)
               Hash Cond: (users_to_items.user_id = users.id)
               ->  Parallel Seq Scan on users_to_items  (cost=0.00..42957.33 rows=2083333 width=8)
               ->  Hash  (cost=8.44..8.44 rows=1 width=4)
                     ->  Index Scan using idx_users_name on users  (cost=0.42..8.44 rows=1 width=4)
                           Index Cond: (name = 'User 999983'::text)
         ->  Index Scan using items_pkey on items  (cost=0.42..0.45 rows=1 width=15)
               Index Cond: (id = users_to_items.item_id)
(11 rows)

One of the first observations is that the users_to_items table is scanned to find mathing rows. Although one might assume that a table with values that reference other tables would automatically be indexed, this is not always the case. Let's add an index to the table users_to_items, where the index is defined on both columns (i.e., it is a multi-column index).

CREATE INDEX idx_users_to_items ON users_to_items (user_id, item_id);

Now, with the index in place, the query is faster.

EXPLAIN SELECT items.name FROM items 
  JOIN users_to_items ON users_to_items.item_id = items.id 
  JOIN users ON users.id = users_to_items.user_id 
  WHERE users.name = 'User 999983';
                                             QUERY PLAN                                             
----------------------------------------------------------------------------------------------------
 Nested Loop  (cost=1.28..15.26 rows=5 width=11)
   ->  Nested Loop  (cost=0.86..13.01 rows=5 width=4)
         ->  Index Scan using idx_users_name on users  (cost=0.42..8.44 rows=1 width=4)
               Index Cond: (name = 'User 999983'::text)
         ->  Index Only Scan using idx_users_to_items on users_to_items  (cost=0.43..4.52 rows=5 width=8)
               Index Cond: (user_id = users.id)
   ->  Index Scan using items_pkey on items  (cost=0.42..0.45 rows=1 width=15)
         Index Cond: (id = users_to_items.item_id)
(8 rows)

Although the costs are arbitrary, the estimated query cost for the version with the index is 15.26, while the estimated query cost for the version without the index is 48435.46. In practice, adding the index improves the performance of the query by reducing it's cost to approximately 0.03% of the original query.

On indexing queries

Note that although indexing queries is often meaningful, it is also possible to degrade the performance of the database by adding indexes. In practice, when deciding what data to index, one starts by analyzing the often used slow queries. Here, external tools like PgHero and PGWATCH are highly useful.

PostgreSQL also has a handful of tables and views that are used to collect cumulative query statistics. For example, the view pg_stat_user_tables contains information on the number of sequential scans over tables (among other things), which can be used for determining what to optimize. Similarly, the table pg_stat_statements provides information of all SQL statements executed on the server. The pg_stat_user_tables is available by default, while the pg_stat_statements needs to be loaded to be taken into use.