What is machine learning?
What is machine learning?
- Nature
- Not a single approach but a wide array of techniques
- Foundations
- Probability theory, Statistics, Linear algebra, Optimization, Control theory
- Applications
- Classification, Regression, Clustering, Density estimation, …
- Related Fields
- Probability theory, Statistical inference, Computational statistics, Data mining, Data science, Computer vision, Artificial intelligence, optimization theory, …
What is machine learning?
- Pattern recognition
- Identifying and understanding patterns in the data
- Improvement over time
- Systems perform better as more data becomes available
- Less manual interventions required as they learn
- Data-driven decisions
- Making decision based on analyzed data rather than pre-defined rules
- Generalization
- Applying learned knowledge to unseen data or scenarios
- Not just memorizing but understanding patterns
- Adaptation
- Changing behavior in response to changing environments or data trends
Main Types of Learning
- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning
- Self-supervised Learning
Main Types of Learning
- Supervised Learning
- Use labeled data to learn how to make predictions.
- E.g., predicting tomorrow's stock price, predicting the number of sales in a particular month
- Unsupervised Learning
- Use unlabeled data to discover patterns
- E.g., clustering similar type of films or songs
- Reinforcement Learning
- Learning through trial and error to maximize rewards
- E.g., self-driving cars, stock-trading bots
- Self-supervised Learning
- Use its own modified data as a label
- E.g., language models, generative models
This course will focus on supervised learning, where our machine is being supervised by labeled data. Like teaching a child with examples; every piece of data comes with a correct answer.
Definition
A computer program is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$, if its performance at tasks in $T$, as measured by $P$, improves with experience $E$. (Mitchell, 1997)
Definition
A computer program is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$, if its performance at tasks in $T$, as measured by $P$, improves with experience $E$. (Mitchell, 1997)
Three ingredients:
- Experience $E$
- Class of tasks $T$
- Performance $P$
Definition
- Experience $E$
- $\mathcal{X}$ - input space (measurement space, feature space)
- $\mathcal{Y}$ - output space (label space, response space)
- Class of tasks $T$
- determine a true function $f : \mathcal{X} \to \mathcal{Y}$
- Performance $P$
- reward or utility function (its negative is a loss function)
Examples
- Experience $E$
- $\mathcal{X}$ - temperature, season, sunshine duration
- $\mathcal{Y}$ - number of bikes rented
- Class of tasks $T$
- determine a function which takes as input the temperature, the season, and sunshine duration to predict the number of bikes rented
- Performance $P$
Examples
- Experience $E$
- $\mathcal{X}$ - market news, analyst ratings, past price
- $\mathcal{Y}$ - future stock price
- Class of tasks $T$
- determine a function which takes as input market news, analyst ratings, and past stock prices to predict future stock prices
- Performance $P$
Examples
- Experience $E$
- $\mathcal{X}$ - images of dogs or cats
- $\mathcal{Y}$ - "dog" or "cat"
- Class of tasks $T$
- determine a function which takes an image as input and assesses whether the image is of a cat or a dog
- Performance $P$
Experience
In supervised learning, experience is straightforward: a dataset of features and matching labels, e.g.,
- $\mathcal{X}$ - temperature, season, sunshine duration
- $\mathcal{Y}$ - number of bikes rented
- $\mathcal{X}$ - market news, analyst ratings, past price
- $\mathcal{Y}$ - future stock price
- $\mathcal{X}$ - images of dogs or cats
- $\mathcal{Y}$ - "dog" or "cat"
Classes of tasks $T$
Do the following examples belong to the same class of tasks?
- Assessing if an image is of a cat or a dog
- Predicting the number of bike sales
- Predicting future stock prices
- Predicting whether the price of stock is going to go up or down
- Predicting whether a creditor will default
- Predicting the number of passengers in a given flight
- Predicting the age of a viewer of an online video
Classes of tasks $T$
- Regression
- Predicting future stock prices
- Predicting the number of bike sales
- Predicting the age of a viewer of an online video
- Predicting the number of passengers in a given flight
- Classification
- Predicting whether a creditor will default
- Assessing if an image is of a cat or a dog
- Predicting whether the price of stock is going to go up or down
Classes of tasks $T$
- Regression
- the output space is continuous and ordered
- Classification
- the output space is discrete and potentially unordered
Performance $P$
The choice of $P$ is typically based on two main considerations:
- its pertinence depends on the task
- it is common to define it through the loss function whose properties are essential for optimization problems and probabilistic learning guarantees
Performance $P$
Back to our previous examples, what are good performance measures for…
- predicting the number of bikes rented in a day
- predicting the future stock price
- determining whether a picture is of a cat or a dog
- learning to play checkers
- generating fake images of people
Loss functions
How do we decide on the best estimate $\hat{f}(\cdot)$ of the true relationship?
Loss functions
Our goal is to choose $\hat{f}(\cdot)$ such that, on average, our predictions are close to the true values… but what does this mean precisely?
Consider $y^{(i)} - \hat{f}(\mathbf{x}^{(i)}) = y^{(i)} - \hat{y}^{(i)}$, our prediction error, i.e., the difference between the true value and our predicted value. If this quantity is positive, we are underpredicting. If it is negative, we are overpredicting.
Intuitively, the closer the prediction error is to zero, the better our prediction. Let us formalize this…
Loss functions
- We denote our loss function as $\ell : \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}$
- Intuitively, the loss function takes two inputs: the predicted value, and the true value and it outputs a number representing how different the two values are.
- For now, we assume our loss to be nonnegative, i.e., $\ell(\hat{y}^{(i)}, y^{(i)}) \geq 0$.
- We will also start by designing our loss function to be zero if and only if its two inputs are equal, i.e. $$\ell(\hat{y}^{(i)}, y^{(i)}) = 0 \iff \hat{y}^{(i)} = y^{(i)}$$
Loss functions
There are many different loss functions, each with advantages and disadvantages depending on the task, e.g.,
- Absolute loss: $\ell_\text{abs}(y, \hat{y}) = \vert\hat{y}-y\vert$
- Squared loss: $\ell_\text{sq}(y, \hat{y}) = \left(\hat{y} -y\right)^2$
- 0-1 loss: $\ell_\text{0-1}(y, \hat{y}) = \begin{cases}0 \text{ if } \hat{y} = y \\ 1 \text{ otherwise.}\end{cases}$
Empirical Risk Minimization
- Equipped with our loss function $\ell$, let us now reconsider the problem of choosing the best predictor $\hat{f}(\cdot)$.
- Suppose we have a set of predictors from which we can choose. We call this set the hypothesis class and denote it by $\mathcal{H}$.
- For instance, in the crop yield example, $\mathcal{H} = \{f_1, f_2, f_3\}$.
Empirical Risk Minimization
We want to find a predictor $\hat{f}(\cdot)$ which minimizes the so-called expected loss or population risk: $$L(f) := \mathbb{E}[\ell(y, f(\mathbf{x}))]$$
… unfortunately, this quantity is unobservable, so we cannot find its minimizer!
Empirical Risk Minimization
While we cannot compute the population risk, we can compute the so-called empirical risk or empirical loss, i.e., the average loss over the data set: $$\hat{L}(f) := \frac{1}{n}\sum_{i=1}^n \ell(y^{(i)}, f(\mathbf{x}^{(i)})).$$
Empirical risk minimization is the fundamental principle of choosing our predictor $\hat{f}$ such that $\hat{L}$ is minimized, i.e., $$\hat{f} = \underset{f \in \mathcal{H}}{\arg\min} \ \hat{L}(f) = \underset{f\in\mathcal{H}}{\arg\min} \ \frac{1}{n}\sum_{i=1}^n \ell(y^{(i)}, f(\mathbf{x}^{(i)}))$$