Machine Learning on the Web

Training a Machine Learning Model


Learning Objectives

  • You know how to create a simple machine learning model.

Here, we concretely look into building a simple machine learning model. We will use the scikit-learn library, which is a popular library for machine learning in Python. It provides a wide range of tools for building and evaluating machine learning models, including support for various algorithms, preprocessing techniques, and evaluation metrics.

Although having a local installation of Python is not mandatory, it can be beneficial. See Quick Start Guide for Python in VS Code for a quick start guide.

Predicting grades

Let’s take our previous example of predicting grades based on coding assignments with data that looks as follows. We can use a supervised learning approach, where we have labeled data (grades) and we want to predict the grade based on the code submission.

ExerciseCode SubmissionGrade
1SELECT * FROM students;100
1SELECT * FROM tudents90
2SELECT name FROM students;100
2SELECT * FROM students65

We can use a simple linear regression model to predict the grade based on the code submission. The model will learn the relationship between the code submission and the grade, and we can use it to predict the grade for new submissions.

We’ll skip the work on validation and testing — which are crucial in actual evaluations of ML models. You’ll get to practice that more in machine learning courses.

Cleaning data

To clean the data, we would e.g. drop rows where the grade is missing, or where the code submission is empty. We can also remove any duplicates or irrelevant columns. In this case, we can keep the exercise number and the code submission, and drop the grade column for now. As the starting point, we can use a Pandas DataFrame to represent the data.

import pandas as pd

data = [
  {"exercise": 1, "code": "SELECT * FROM students;", "grade": 100},
  {"exercise": 1, "code": "SELECT * FROM tudents", "grade": 90},
  {"exercise": 2, "code": "SELECT name FROM students;", "grade": 100},
  {"exercise": 2, "code": "SELECT * FROM students", "grade": 65},
]

df = pd.DataFrame(data)

Preprocessing

Then, during preprocessing, we convert the text-based data into numeric data that can be used by the model. In our case, we can just pick simple n-grams (sequences of n words) from the code submission.

Scikit-learn comes with nice functionality for transforming text data into numeric data. The CountVectorizer class can be used to convert a collection of text documents to a matrix of token counts; it can also be given a ngram_range parameter to specify the range of n-grams to be extracted. For example, if we set ngram_range=(1, 3), it will extract unigrams (single words), bigrams (two-word sequences), and trigrams (three-word sequences) from the text. This allows the model to capture more context and relationships between words, which can improve its performance.

The CountVectorizer would be given to a ColumnTransformer, which allows us to apply different preprocessing steps to different columns of the DataFrame. In this case, we can use the CountVectorizer on the code column and leave the exercise column as is. The ColumnTransformer will then combine the results into a single feature matrix that can be used for training the model.

from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer

preprocessor = ColumnTransformer(
  transformers=[
    ('code', CountVectorizer(ngram_range=(1, 3)), 'code'),
    ('exercise', 'passthrough', ['exercise'])
  ]
)

Creating a machine learning pipeline

Then, we can create a Pipeline that takes in the preprocessor and the model. The Pipeline will first apply the preprocessor to the data, and then fit the model to the transformed data.

from sklearn.pipeline import Pipeline

pipeline = Pipeline(steps=[
  ('preprocessor', preprocessor),
  ('regressor', LinearRegression())
])

Finally, we can define the features and target variable, and fit the model to the data. The features will be the exercise column and the transformed code column, and the target variable will be the grade column.

X = df[['exercise', 'code']]
y = df['grade']

Training the model

Now, all that would be remaining is fitting the model to the data and making predictions. The Pipeline will take care of applying the preprocessor and fitting the model in one go.

pipeline.fit(X, y)

When the above would be run, the pipeline would have a fitted model that can be used to make predictions on new data. For example, we can predict the grade for a new code submission by passing the exercise number and the code submission to the pipeline.

Saving the model

Once the model has been trained, we want to save it. This can be done using the joblib library, which is commonly used for saving and loading machine learning models in Python. The joblib library is particularly efficient for saving large NumPy arrays, which are often used in machine learning models.

import joblib

joblib.dump(pipeline, 'grade_predictor.joblib')

Training code altogether

Together, the code that we built above would look as follows.

import joblib
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

data = [
  {"exercise": 1, "code": "SELECT * FROM students;", "grade": 100},
  {"exercise": 1, "code": "SELECT * FROM tudents", "grade": 90},
  {"exercise": 2, "code": "SELECT name FROM students;", "grade": 100},
  {"exercise": 2, "code": "SELECT * FROM students", "grade": 65},
]
df = pd.DataFrame(data)

preprocessor = ColumnTransformer(
  transformers=[
    ('code', CountVectorizer(ngram_range=(1, 3)), 'code'),
    ('exercise', 'passthrough', ['exercise'])
  ]
)

pipeline = Pipeline(steps=[
  ('preprocessor', preprocessor),
  ('regressor', LinearRegression())
])

X = df[['exercise', 'code']]
y = df['grade']

pipeline.fit(X, y)

joblib.dump(pipeline, 'grade_predictor.joblib')

When the above is run, the pipeline will be fitted to the data, and the model will learn the relationship between the exercise number, code submission, and grade. The CountVectorizer will convert the text data into a matrix of token counts, which will be used by the LinearRegression model to make predictions. Finally, the fitted pipeline will be saved to a file called grade_predictor.joblib, which can be loaded later for making predictions on new data.

Using the model

With the above in place, we can take new data and predict the grade for a new code submission. For example, if we have a new code submission for exercise 2, we can load the pipeline and use it to predict the grade.

import pandas as pd
import joblib

pipeline = joblib.load('grade_predictor.joblib')

input_data = pd.DataFrame({
  'exercise': [2],
  'code': ["SELECT * FROM students;"]
})
prediction = pipeline.predict(input_data)[0]
print("Predicted grade:", prediction)

The output of the above would be something like this.

Predicted grade: 65.0

At the same time, if we change the exercise to 1, the prediction would be different.

import pandas as pd
import joblib

pipeline = joblib.load('grade_predictor.joblib')

input_data = pd.DataFrame({
  'exercise': [1],
  'code': ["SELECT * FROM students;"]
})
prediction = pipeline.predict(input_data)[0]
print("Predicted grade:", prediction)

The output of the above would be something like this.

Predicted grade: 99.99999999999999

When performing multiple predictions, we would naturally load the model just once, and keep it in memory for the duration of the program. The above is just to illustrate how the model can be used to make predictions on new data.

More on model selection and evaluation

Above, for simplicity, we picked a basic model and did not do any concrete evaluation. For more complex tasks, you would typically want to evaluate different models and select the best one based on performance metrics. For more on model selection and evaluation, see scikit-learn documentation.