Machine Learning on the Web

Creating a Machine Learning API


Learning Objectives

  • You know how to deploy a machine learning model to a container and use it in a web application.

Given a machine learning model stored to a file with joblib, we can create a container that runs a server and uses the model to make predictions. For Python, there are a range of web frameworks available, such as Flask and FastAPI.

Note that this is a simplified example. In a real-world scenario, you would need to consider additional factors such as security, error handling, and performance optimization.

Creating an inference API

As we haven’t worked with the walking skeleton for a while, it’s good to get back to it. First, in the folder of the walking skeleton, create a folder called inference-api. This will contain the code for our inference API.

The term inference API refers to an API that serves a machine learning model for inference, i.e., making predictions based on input data. It typically exposes endpoints that allow clients to send data to the model and receive predictions in response.

Create three files to the folder: app.py, requirements.txt, and Dockerfile. The app.py file will contain the code for the API, the requirements.txt file is used to specify the Python dependencies for the project, and the Dockerfile will contain the instructions to build the container.

First, add the following content to the requirements.txt file:

fastapi==0.115.12
joblib==1.4.2
pandas==2.2.3
scikit-learn==1.6.1
uvicorn==0.34.0

Each of the above are commonly used Python libraries for building web applications and machine learning models. The fastapi library is used to create the web API, joblib is used to load the machine learning model, pandas is used for data manipulation, scikit-learn is used for machine learning, and uvicorn is used as the ASGI server to run the FastAPI application.

Then, add the following code to the app.py file:

from fastapi import FastAPI, Request
import joblib
import pandas as pd

server = FastAPI()

model = joblib.load("model.joblib")

@server.post("/inference-api/predict")
async def predict(request: Request):
  data = await request.json()

  input_data = pd.DataFrame({
    'exercise': [data.get("exercise")],
    'code': [data.get("code")]
  })

  prediction = model.predict(input_data)[0]

  return {"prediction": prediction}

The above creates a FastAPI server with a single endpoint /inference-api/predict that accepts a POST request with JSON data. The data is then converted to a Pandas DataFrame, which is used as input to the machine learning model. The prediction is returned as a JSON response.

The above API effectively does the same as the example from the previous chapter, but now it is run in a web application and the model is loaded only at the start of the server. This is a common pattern in machine learning applications, where the model is loaded once and used to make predictions for multiple requests.

Finally, add the following content to the Dockerfile file:

FROM python:3.13-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD ["uvicorn", "app:server", "--host", "0.0.0.0", "--port", "8000"]

Above, we use the python:3.13-slim image as the base image for our container. We then set the working directory to /app, copy the requirements.txt file, and install the dependencies. After that, we copy the rest of the files to the container and run the FastAPI server using uvicorn.

Docker compose

Next, modify the compose.yaml file in the root of the walking skeleton to include the inference API. We’ll configure the inference API to run in a separate container and expose it through Traefik.

Add the following to the compose.yaml file:

  inference-api:
    build: inference-api
    restart: unless-stopped
    volumes:
      - ./inference-api:/app
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.inference-api-router.entrypoints=web"
      - "traefik.http.routers.inference-api-router.rule=PathPrefix(`/inference-api`)"
      - "traefik.http.services.inference-api-service.loadbalancer.server.port=8000"

Testing the inference API

To test the inference API, copy the model that you built in the earlier chapter to the inference-api folder and name at as model.joblib. This is the model that will be used by the inference API to make predictions.

You may alternatively build the model in the inference API container, but this is not recommended for production use. In production, you would typically build the model in a separate container and then copy it to the inference API container.

Then, start the walking skeleton by running the following command in the root of the walking skeleton:

docker compose up --build

Now, we can test the inference API using curl or any other HTTP client. For example, we can use the following command to send a POST request to the /inference-api/predict endpoint with a sample input:

curl -X POST -d '{"exercise": 1, "code": "SELECT * FROM secret;"}' localhost:8000/inference-api/predict
{"prediction":94.99999999999997}%

If the server responds with an error, check the logs and make sure that the model is loaded correctly and that the input data is in the correct format.

If you built the model with a different version of the libraries, you may need to rebuild the model with the same version as the one used in the inference API container. This is important because different versions of libraries may have different implementations and may not be compatible with each other.

Scaling the inference API

The inference API that we built above is stateless, which means that we can scale it horizontally by running multiple instances of the container and using a load balancer to route requess to them. This holds practically for all inference APIs, as they just host a model, route requests to it, and return the predictions.

The usual approaches that we’ve seen for scaling web applications also apply here — for example, we could add a cache in front of the inference API to cache the predictions and reduce the load on the model, or we could use a message queue to handle the requests asynchronously and process them in the background, returning the results later. Similarly, we could use Kubernetes to manage the containers and automatically scale them based on the load.

There are plenty of dedicated tools for the task, such as Kubeflow and MLflow, which provide a complete solution for deploying and managing machine learning models in production. Tools like Cog can also make the creation of containers easier, as they provide a simple interface for building and deploying machine learning models in containers. These are, however, out of the scope for this course.