Training a Machine Learning Model
Learning Objectives
- You know how to create a simple machine learning model.
Here, we concretely look into building a simple machine learning model. We will use the scikit-learn
library, which is a popular library for machine learning in Python. It provides a wide range of tools for building and evaluating machine learning models, including support for various algorithms, preprocessing techniques, and evaluation metrics.
Although having a local installation of Python is not mandatory, it can be beneficial. See Quick Start Guide for Python in VS Code for a quick start guide.
Predicting grades
Let’s take our previous example of predicting grades based on coding assignments with data that looks as follows. We can use a supervised learning approach, where we have labeled data (grades) and we want to predict the grade based on the code submission.
Exercise | Code Submission | Grade |
---|---|---|
1 | SELECT * FROM students; | 100 |
1 | SELECT * FROM tudents | 90 |
2 | SELECT name FROM students; | 100 |
2 | SELECT * FROM students | 65 |
We can use a simple linear regression model to predict the grade based on the code submission. The model will learn the relationship between the code submission and the grade, and we can use it to predict the grade for new submissions.
We’ll skip the work on validation and testing — which are crucial in actual evaluations of ML models. You’ll get to practice that more in machine learning courses.
Cleaning data
To clean the data, we would e.g. drop rows where the grade is missing, or where the code submission is empty. We can also remove any duplicates or irrelevant columns. In this case, we can keep the exercise number and the code submission, and drop the grade column for now. As the starting point, we can use a Pandas DataFrame to represent the data.
import pandas as pd
data = [
{"exercise": 1, "code": "SELECT * FROM students;", "grade": 100},
{"exercise": 1, "code": "SELECT * FROM tudents", "grade": 90},
{"exercise": 2, "code": "SELECT name FROM students;", "grade": 100},
{"exercise": 2, "code": "SELECT * FROM students", "grade": 65},
]
df = pd.DataFrame(data)
Preprocessing
Then, during preprocessing, we convert the text-based data into numeric data that can be used by the model. In our case, we can just pick simple n-grams (sequences of n words) from the code submission.
Scikit-learn comes with nice functionality for transforming text data into numeric data. The CountVectorizer
class can be used to convert a collection of text documents to a matrix of token counts; it can also be given a ngram_range
parameter to specify the range of n-grams to be extracted. For example, if we set ngram_range=(1, 3)
, it will extract unigrams (single words), bigrams (two-word sequences), and trigrams (three-word sequences) from the text. This allows the model to capture more context and relationships between words, which can improve its performance.
The CountVectorizer
would be given to a ColumnTransformer
, which allows us to apply different preprocessing steps to different columns of the DataFrame. In this case, we can use the CountVectorizer
on the code
column and leave the exercise
column as is. The ColumnTransformer
will then combine the results into a single feature matrix that can be used for training the model.
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
preprocessor = ColumnTransformer(
transformers=[
('code', CountVectorizer(ngram_range=(1, 3)), 'code'),
('exercise', 'passthrough', ['exercise'])
]
)
Creating a machine learning pipeline
Then, we can create a Pipeline
that takes in the preprocessor and the model. The Pipeline
will first apply the preprocessor to the data, and then fit the model to the transformed data.
from sklearn.pipeline import Pipeline
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('regressor', LinearRegression())
])
Finally, we can define the features and target variable, and fit the model to the data. The features will be the exercise
column and the transformed code
column, and the target variable will be the grade
column.
X = df[['exercise', 'code']]
y = df['grade']
Training the model
Now, all that would be remaining is fitting the model to the data and making predictions. The Pipeline
will take care of applying the preprocessor and fitting the model in one go.
pipeline.fit(X, y)
When the above would be run, the pipeline would have a fitted model that can be used to make predictions on new data. For example, we can predict the grade for a new code submission by passing the exercise
number and the code
submission to the pipeline.
Saving the model
Once the model has been trained, we want to save it. This can be done using the joblib
library, which is commonly used for saving and loading machine learning models in Python. The joblib
library is particularly efficient for saving large NumPy arrays, which are often used in machine learning models.
import joblib
joblib.dump(pipeline, 'grade_predictor.joblib')
Training code altogether
Together, the code that we built above would look as follows.
import joblib
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
data = [
{"exercise": 1, "code": "SELECT * FROM students;", "grade": 100},
{"exercise": 1, "code": "SELECT * FROM tudents", "grade": 90},
{"exercise": 2, "code": "SELECT name FROM students;", "grade": 100},
{"exercise": 2, "code": "SELECT * FROM students", "grade": 65},
]
df = pd.DataFrame(data)
preprocessor = ColumnTransformer(
transformers=[
('code', CountVectorizer(ngram_range=(1, 3)), 'code'),
('exercise', 'passthrough', ['exercise'])
]
)
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('regressor', LinearRegression())
])
X = df[['exercise', 'code']]
y = df['grade']
pipeline.fit(X, y)
joblib.dump(pipeline, 'grade_predictor.joblib')
When the above is run, the pipeline
will be fitted to the data, and the model will learn the relationship between the exercise number, code submission, and grade. The CountVectorizer
will convert the text data into a matrix of token counts, which will be used by the LinearRegression
model to make predictions. Finally, the fitted pipeline will be saved to a file called grade_predictor.joblib
, which can be loaded later for making predictions on new data.
Using the model
With the above in place, we can take new data and predict the grade for a new code submission. For example, if we have a new code submission for exercise 2, we can load the pipeline and use it to predict the grade.
import pandas as pd
import joblib
pipeline = joblib.load('grade_predictor.joblib')
input_data = pd.DataFrame({
'exercise': [2],
'code': ["SELECT * FROM students;"]
})
prediction = pipeline.predict(input_data)[0]
print("Predicted grade:", prediction)
The output of the above would be something like this.
Predicted grade: 65.0
At the same time, if we change the exercise to 1, the prediction would be different.
import pandas as pd
import joblib
pipeline = joblib.load('grade_predictor.joblib')
input_data = pd.DataFrame({
'exercise': [1],
'code': ["SELECT * FROM students;"]
})
prediction = pipeline.predict(input_data)[0]
print("Predicted grade:", prediction)
The output of the above would be something like this.
Predicted grade: 99.99999999999999
When performing multiple predictions, we would naturally load the model just once, and keep it in memory for the duration of the program. The above is just to illustrate how the model can be used to make predictions on new data.
Above, for simplicity, we picked a basic model and did not do any concrete evaluation. For more complex tasks, you would typically want to evaluate different models and select the best one based on performance metrics. For more on model selection and evaluation, see scikit-learn documentation.