Machine Learning on the Web

Machine Learning Overview


Learning Objectives

  • You know of machine learning and can distinguish between different types of machine learning and their common applications.

Machine learning

Machine learning (ML) is a branch of artificial intelligence (AI) that enables software systems to learn patterns from data, make predictions, or perform actions without explicit programming for specific cases. ML is increasingly integrated into web applications, enhancing functionalities such as personalized recommendations, search, image and speech recognition, and automated decision-making.

For a brief overview, check out the Introduction to Large Language Models course. As an Aalto student, you may also take CS-C3240 Machine Learning and explore the offerings in the comprehensive degree programmes such as the MSc in Machine Learning, Data Science, and Artificial Intelligence.

Learning from data

Many machine learning tasks involve building a model, which is a function that maps inputs (data) to outputs (predictions or decisions). Building a model requires relevant and high-quality data. Once the model has been trained, it can be used to make predictions on new, unseen data.

The model’s performance is evaluated based on how accurately it predicts the outcomes for this new data.

To build models, we need relevant and high-quality data. The data should be representative of the problem we want to solve and contain enough examples to learn from. The more diverse and comprehensive the data, the better the model can generalize to new situations.

As data is often messy and incomplete, it typically needs cleaning and preprocessing. Cleaning involves removing missing values, handling outliers, correcting inconsistencies, while preprocessing involves transforming the data into a suitable format for the model. This may include, for example, normalizing numerical values, encoding text or categorical variables into numeric values, and scaling values to a common range.

For example, consider a simple dataset of coding assignments and grades:

ExerciseCode SubmissionGrade
1SELECT * FROM students;100
1SELECT * FROM tudents90
2SELECT name FROM students;100
2nullnull
2SELECT * FROM students65
2SELECT * FROM studentsnull

In the above dataset, there are two rows with missing values (the second and last rows). We can clean the data by removing these rows, as they do not provide useful information for training the model. After cleaning, we have:

ExerciseCode SubmissionGrade
1SELECT * FROM students;100
1SELECT * FROM tudents90
2SELECT name FROM students;100
2SELECT * FROM students65

Next, we need to preprocess the data. For example, we need to transform the text in the “Code Submission” column into a vector representation, which is a numerical representation of the text. This is important because machine learning models typically work with numerical data. There are a range of techniques for this, including e.g. N-grams, where we represent each word or phrase as a unique feature.

N-grams are discussed in a bit more detail in the chapter Word n-gram Language Models of the course Introduction to Large Language Models.

Loading Exercise...

Types of machine learning

There are three primary categories of machine learning:

  • Supervised Learning: Models are trained using labeled data, where each example has an associated correct output. Common examples include predicting grades, detecting spam emails, or identifying images.

  • Unsupervised Learning: Models analyze unlabeled data to find hidden patterns or groupings (clusters). Examples include finding frequent patterns in customer transactions, clustering similar items, or reducing dimensionality of data.

  • Reinforcement Learning: Models learn by interacting with their environment, optimizing actions based on feedback (rewards). Examples include developing agents for game playing and improving robotic control systems.

There are also hybrid approaches, such as semi-supervised learning, which combines labeled and unlabeled data, and transfer learning, where knowledge from one task is applied to another.

For each type of machine learning, there are various algorithms and techniques that can be used. For example, supervised learning can use algorithms like linear regression, decision trees, or support vector machines, while unsupervised learning can use clustering algorithms like k-means or hierarchical clustering.

The choice of model depends on the specific application, data availability, and desired accuracy.

Training and evaluating ML models

When training a machine learning model, we typically start with a dataset that contains examples of the problem we want to solve. The model learns from this data by adjusting its parameters to minimize the error in its predictions. This process is called training.

To avoid a situation, where the model learns the training data too well (overfitting), we need to evaluate the model’s performance on a separate dataset. This is called validation. The validation dataset is used to fine-tune the model’s parameters and optimize its performance.

Finally, we need to evaluate the model’s performance on a third dataset, called the test dataset. The test dataset is used to assess how well the model generalizes to new, unseen data. This is important because the ultimate goal of machine learning is to create models that can make accurate predictions on new data, not just the data they were trained on.

The concrete evaluation is based on the objective function we defined earlier. For example, if we are building a classification model, we might use accuracy as the objective function. This means that we want to maximize the number of correct predictions made by the model. If we are building a regression model, we might use mean squared error (MSE) as the objective function. This means that we want to minimize the average squared difference between the predicted and actual values.

Much of the training and evaluation process is automated by libraries and frameworks, which handle the underlying mathematics and optimization algorithms. These libraries are mainly used via Python, which is the most popular language for machine learning. One of the most popular libraries is sci-kit learn, which provides a wide range of algorithms for supervised and unsupervised learning. There are plenty other options as well, including TensorFlow and PyTorch, which are popular for deep learning tasks, and Hugging Face for natural language processing tasks and more.

Loading Exercise...