Data & Data Scalability

Database Setup, Studying Queries, and Connection Pools

Learning Objectives

You rehearse setting up a PostgreSQL database and running migrations.
You know of tools to study slow queries and database connections.
You understand the concept of connection pools and their benefits.

Adding PostgreSQL database

Building on the walking skeleton with Deno and Astro from the end of the last part, follow the chapter PostgreSQL database and migrations from the Web Software Development course and add a PostgreSQL database with database migrations to the walking skeleton.

Once ready, you should have an application with a database that responds to requests.In particular, a query to the path localhost:8000/todos should return the todos from the database — or an empty array if there are no todos in the database.

curl localhost:8000/todos
[]%
curl localhost:8000
{"message":"Hello world!"}%

Studying database queries

PostgreSQL comes with ready-made functionality that allows studying database queries. The extension auto_explain provides the functionality for logging slow database queries, while the extension pg_stat_statements allows recording statistics of SQL queries.

Both of these extensions induce a small overhead, so they should be used with care and typically only in development or when investigating performance in production.

Enabling auto_explain and pg_stat_statements

To add them to use, modify compose.yaml of the project to add a command that loads the libraries.

  database:
    container_name: postgresql_database
    image: postgres:17.0
    restart: unless-stopped
    env_file:
      - project.env
    command: >
      postgres
      -c shared_preload_libraries=auto_explain,pg_stat_statements
      -c auto_explain.log_min_duration=10

Now, when you restart the application, PostgreSQL will log statements that take more than 10 milliseconds to run. It also provides functionality for storing statement statistics in the database.

The logs are available in the docker logs of the database container. To see the logs, run the following command. When running the application with docker compose up, the logs are shown in the terminal.

Enabling statement statistics

The statement statistics need to be enabled for each database that you want to monitor. To enable the statistics, create a new migration file, say V2__enable_pg_stat_statements.sql, and add the following content to it:

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

When you again restart the application and this migration file is run, a table (or view) called pg_stat_statements can be queried in the database. It has numerous columns with query statistics, including the query itself, the number of times the query has been run, the total time spent running the query, and the average time the query has taken.

For additional details on the columns, see the documentation of pg_stat_statements.

As an example, if we wish to see the top ten average query execution times of queries that have been run more than 5 times, sorted based on average query time, we can run the following query:

SELECT query, mean_exec_time
  FROM pg_stat_statements
  WHERE calls > 5
  ORDER BY mean_exec_time DESC
  LIMIT 10;

The column mean_exec_time shows the average query time in milliseconds — most, if not all, of the queries shown in the result should be sub-millisecond under minimal load.

When you initially look at the query results, you may notice that most rows are not relevant to the application. This is partly due to the database not having been queried heavily yet, and partly because the default user is often used for many other tasks. To better see queries from your application, visit localhost:8000/todos a handful of times and run the above query again. You should now see the query that selects the todos from the database.

User-specific queries

To limit the listed queries to a specific user, you can filter based on the userid column. The userid column corresponds to the usesysid column in the pg_user table. As an example, the following would limit the queries to those made by the user username.

 SELECT query, mean_exec_time
  FROM pg_stat_statements
  WHERE calls > 5
    AND userid = (SELECT usesysid
      FROM pg_user
      WHERE usename = 'username'
    )
  ORDER BY mean_exec_time DESC
  LIMIT 10;

Note that as username is the default superuser created at the time of the database creation in the walking skeleton, it is also used to create databases, run migrations, etc. In a production setting, it’s best practice to create a separate, non-superuser role specifically for your application.

Studying connections

We can also look into database connections by examining the table pg_stat_activity. This table contains information about the current connections to the database, including the query being run, the user who is running it, and how long it has been running.

For example, the following query shows the current connections to the database, including the connection ID, the name of the user, the time when the connection was initiated, the time when the last query was run, the state of the connection, and the last query executed on that connection:

SELECT pid, usename, backend_start, query_start, state, query
  FROM pg_stat_activity
  WHERE state IS NOT null;

When you run the above query, you see at least one connection, which is the connection that you are using to run the query. The state column shows the state of the connection: active means that the connection is running a query, while idle means that the connection is not currently running a query.

Now, open up the address localhost:8000/todos in your browser or via curl, and run the above query again. You should now see an additional connection — the one from the server — which, after the query completes, will be in an idle state.

Connection pools

The reason why the connection remains open (in idle state) is that the Postgres.js driver that we use employs a connection pool. A connection pool acts as an intermediary between the web server and the database server, managing the connections to the database server.

When an application needs to execute a database query, it first requests a connection from the pool. If an idle connection is available, it’s used. Otherwise, as long as the pool has not reached its maximum size, it creates a new physical connection and uses that. The query is then executed over the connection, and once complete, the connection is returned to the pool. This approach saves the overhead of repeatedly opening and closing connections.

Fig 1. — Connection pool flow

Pool-based vs non-pool-based connections

Using a connection pool significantly improves the performance of an application because it effectively removes the overhead of constantly establishing new database connections. On a local machine with minimal load, the difference might not feel big, but in a production environment or when handling numerous requests per second, connection pooling makes a major impact.

Below is an version of our server-side code (in server/app.js) that uses a connection pool. The connection pool is created when the application starts and is used for all database queries. The pool is closed when the application is shut down.

import { Hono } from "@hono/hono";
import { cors } from "@hono/hono/cors";
import { logger } from "@hono/hono/logger";
import postgres from "postgres";

const app = new Hono();
const sql = postgres();

app.use("/*", cors());
app.use("/*", logger());

app.get("/", (c) => c.json({ message: "Hello world!" }));
app.get("/todos", async (c) => {
  const todos = await sql`SELECT * FROM todos`;
  return c.json(todos);
});

export default app;

When running the application and making requests to it, we see from the server logs that the first query takes a while — as it establishes a connection — but the subsequent queries are faster:

server-1               | Listening on http://0.0.0.0:8000/
server-1               | <-- GET /todos
server-1               | --> GET /todos 200 87ms
server-1               | <-- GET /todos
server-1               | --> GET /todos 200 1ms
server-1               | <-- GET /todos
server-1               | --> GET /todos 200 1ms
server-1               | <-- GET /todos
server-1               | --> GET /todos 200 1ms
server-1               | <-- GET /todos
server-1               | --> GET /todos 200 1ms

The following, on the other hand, establishes a new connection for each request.

import { Hono } from "@hono/hono";
import { cors } from "@hono/hono/cors";
import { logger } from "@hono/hono/logger";
import postgres from "postgres";

const app = new Hono();

app.use("/*", cors());
app.use("/*", logger());

app.get("/", (c) => c.json({ message: "Hello world!" }));
app.get("/todos", async (c) => {
  const sql = postgres();
  const todos = await sql`SELECT * FROM todos`;
  sql.end();
  return c.json(todos);
});

export default app;

Similarly, when running the above code and making requests to the server, we see that each query takes a while, as a new connection is established for each query:

server-1               | Listening on http://0.0.0.0:8000/
server-1               | <-- GET /todos
server-1               | --> GET /todos 200 87ms
server-1               | <-- GET /todos
server-1               | --> GET /todos 200 62ms
server-1               | <-- GET /todos
server-1               | --> GET /todos 200 64ms
server-1               | <-- GET /todos
server-1               | --> GET /todos 200 49ms
server-1               | <-- GET /todos
server-1               | --> GET /todos 200 56ms
server-1               | <-- GET /todos
server-1               | --> GET /todos 200 74ms

In the version that uses a connection pool, each query — once the connection is established — is in the range of milliseconds, while in the version without the connection pool, each query is in the range of tens of milliseconds. The difference can be larger in environments where the servers are further apart or under heavier load.

Loading Exercise...

← Data Distribution: Replication, Sharding, and CAP & PACELC

Indexing and Query Optimization →