Creating a Machine Learning API
Learning Objectives
- You know how to deploy a machine learning model to a container and use it in a web application.
Given a machine learning model stored to a file with joblib
, we can create a container that runs a server and uses the model to make predictions. For Python, there are a range of web frameworks available, such as Flask and FastAPI.
Note that this is a simplified example. In a real-world scenario, you would need to consider additional factors such as security, error handling, and performance optimization.
Creating an inference API
As we haven’t worked with the walking skeleton for a while, it’s good to get back to it. First, in the folder of the walking skeleton, create a folder called inference-api
. This will contain the code for our inference API.
The term inference API refers to an API that serves a machine learning model for inference, i.e., making predictions based on input data. It typically exposes endpoints that allow clients to send data to the model and receive predictions in response.
Create three files to the folder: app.py
, requirements.txt
, and Dockerfile
. The app.py
file will contain the code for the API, the requirements.txt
file is used to specify the Python dependencies for the project, and the Dockerfile
will contain the instructions to build the container.
First, add the following content to the requirements.txt
file:
fastapi==0.115.12
joblib==1.4.2
pandas==2.2.3
scikit-learn==1.6.1
uvicorn==0.34.0
Each of the above are commonly used Python libraries for building web applications and machine learning models. The fastapi
library is used to create the web API, joblib
is used to load the machine learning model, pandas
is used for data manipulation, scikit-learn
is used for machine learning, and uvicorn
is used as the ASGI server to run the FastAPI application.
Then, add the following code to the app.py
file:
from fastapi import FastAPI, Request
import joblib
import pandas as pd
server = FastAPI()
model = joblib.load("model.joblib")
@server.post("/inference-api/predict")
async def predict(request: Request):
data = await request.json()
input_data = pd.DataFrame({
'exercise': [data.get("exercise")],
'code': [data.get("code")]
})
prediction = model.predict(input_data)[0]
return {"prediction": prediction}
The above creates a FastAPI server with a single endpoint /inference-api/predict
that accepts a POST request with JSON data. The data is then converted to a Pandas DataFrame, which is used as input to the machine learning model. The prediction is returned as a JSON response.
The above API effectively does the same as the example from the previous chapter, but now it is run in a web application and the model is loaded only at the start of the server. This is a common pattern in machine learning applications, where the model is loaded once and used to make predictions for multiple requests.
Finally, add the following content to the Dockerfile
file:
FROM python:3.13-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["uvicorn", "app:server", "--host", "0.0.0.0", "--port", "8000"]
Above, we use the python:3.13-slim
image as the base image for our container. We then set the working directory to /app
, copy the requirements.txt
file, and install the dependencies. After that, we copy the rest of the files to the container and run the FastAPI server using uvicorn
.
Docker compose
Next, modify the compose.yaml
file in the root of the walking skeleton to include the inference API. We’ll configure the inference API to run in a separate container and expose it through Traefik.
Add the following to the compose.yaml
file:
inference-api:
build: inference-api
restart: unless-stopped
volumes:
- ./inference-api:/app
labels:
- "traefik.enable=true"
- "traefik.http.routers.inference-api-router.entrypoints=web"
- "traefik.http.routers.inference-api-router.rule=PathPrefix(`/inference-api`)"
- "traefik.http.services.inference-api-service.loadbalancer.server.port=8000"
Testing the inference API
To test the inference API, copy the model that you built in the earlier chapter to the inference-api
folder and name at as model.joblib
. This is the model that will be used by the inference API to make predictions.
You may alternatively build the model in the inference API container, but this is not recommended for production use. In production, you would typically build the model in a separate container and then copy it to the inference API container.
Then, start the walking skeleton by running the following command in the root of the walking skeleton:
docker compose up --build
Now, we can test the inference API using curl
or any other HTTP client. For example, we can use the following command to send a POST request to the /inference-api/predict
endpoint with a sample input:
curl -X POST -d '{"exercise": 1, "code": "SELECT * FROM secret;"}' localhost:8000/inference-api/predict
{"prediction":94.99999999999997}%
If the server responds with an error, check the logs and make sure that the model is loaded correctly and that the input data is in the correct format.
If you built the model with a different version of the libraries, you may need to rebuild the model with the same version as the one used in the inference API container. This is important because different versions of libraries may have different implementations and may not be compatible with each other.
Scaling the inference API
The inference API that we built above is stateless, which means that we can scale it horizontally by running multiple instances of the container and using a load balancer to route requess to them. This holds practically for all inference APIs, as they just host a model, route requests to it, and return the predictions.
The usual approaches that we’ve seen for scaling web applications also apply here — for example, we could add a cache in front of the inference API to cache the predictions and reduce the load on the model, or we could use a message queue to handle the requests asynchronously and process them in the background, returning the results later. Similarly, we could use Kubernetes to manage the containers and automatically scale them based on the load.
There are plenty of dedicated tools for the task, such as Kubeflow and MLflow, which provide a complete solution for deploying and managing machine learning models in production. Tools like Cog can also make the creation of containers easier, as they provide a simple interface for building and deploying machine learning models in containers. These are, however, out of the scope for this course.