Introduction to Machine Learning

You are not logged in.
Please Log In for full access to the web site.
Note that this link will take you to an external site (https://shimmer-mit-edu.ezproxy.canberra.edu.au) to authenticate, and then you will be redirected back to this page.

Due: Monday, September 09, 2024 at 11:00 PM

Lab attendance check

Type in your section passcode to get attendance credit
(normally within the first 15 minutes of your scheduled section; but today only, passcode is accepted anytime in-session.)
Passcode:

Instructions

In 6.390 a large part of the learning happens by discussing lab questions with partners. Please complete this group self-partnering question then begin the lab.

Create/join a group

One person (and only one) should create a group.
New group name:
Everyone else should enter the group name below (in the form groupname_0000).
Join group:

To join another group or leave your current group, reload the page.

You are not in a group.

1) Lab: Random Hypotheses§

In this lab we will experiment with a simple algorithm for trying to find a line that fits well a set of data \mathcal{D}_n=\left\{\left(x^{(1)}, y^{(1)}\right), \ldots,\left(x^{(n)}, y^{(n)}\right)\right\} where x^{(i)}\in \mathbb{R} and y^{(i)}\in \mathbb{R}. We will generate a lot of random hypotheses within the linear hypothesis class -- i.e., several models with linear parametrization but each with randomly generated parameters. Or more specifically in this case, a lot of lines, each with randomly generated slope and intercept.

We will then see which of these randomly generated models has the smallest error on this data, and return that one as our answer. We don't recommend this method in practice, but it gets us started and makes some useful insights.

Note that this lab builds on the lecture notes Chapter 1 - Intro to ML and Chapter 2 - Regression for background and terminology, and the algorithm itself is a version of the random-regression algorithm from the notes Section 2.4. You'll be expected to read those notes before you start on Homework 1. In fact, Homework 1 will ask you to implement code that we're only describing or using here (you don't get to see all that code in this lab) -- so this lab will help get you ready for the homework!

It is good practice for you and your team to write down short answers to the lab questions as you work through them so that you are prepared for a check-off conversation at the end.

Here is the top-level code for the algorithm random_regress(X, Y, k) defined below:

X is a d \times n matrix with n training input points in a d-dimensional space.
Y is a 1 \times n vector with n training output values.
k is the number of random hypotheses to try.

All of this is normal Python and numpy. This function uses lin_reg_err, which is one of those functions you will write in Homework 1 (and where you'll also get more practice with numpy).

The random_regress function below takes in the training data, and k hypotheses (in the ths and th0s arrays) and returns a vector of k values of the mean squared error (MSE) of each of the hypotheses on the data.

In particular, consider one hypothesis (out of the k possible hypotheses). Suppose the parameters of this hypothesis are \theta and \theta_0; then its MSE is defined as

\text{MSE} = \frac{1}{n}\sum_{i = 1}^{n}\left(g^{(i)} - y^{(i)}\right)^2

where g^{(i)}=\theta^T x^{(i)} + \theta_0 is the prediction or guess by the hypothesis for sample i, and y^{(i)} is the actual label for sample i.

def random_regress(X, Y, k):
    d, n = X.shape

    # generate k random hypotheses
    ths = np.random.randn(d, k)
    th0s = np.random.randn(1, k)

    # compute the mean squared error of each hypothesis on the data set
    errors = lin_reg_err(X, Y, ths, th0s.T)

    # Find the index of the hypotheses with the lowest error
    i = np.argmin(errors)

    # return the theta and theta0 parameters that define that hypothesis
    theta, theta0 = ths[:,i:i+1], th0s[:,i:i+1]
    return (theta, theta0), errors[i]

You'll get more practice with numpy in the homework, but in the meantime you might find the numpy reference for np.random.randn helpful.

1.1) §

Talk through the code for random_regress with your group, and try to get a basic understanding of what is going on. Put yourself in the help queue to ask if you have any questions!

1.2) §

We'll start by studying example executions of this algorithm with a data set where d = 1, n = 10 for different values of k (so the horizontal axis is the single dimension of x and the vertical axis is the output y). The hypothesis with minimal mean squared error (MSE), of the k that were tested, is shown in red.

Here is the data set and one random hypothesis.

Is this a good hypothesis?
For all hypotheses within the linear hypothesis class (all including those we didn't get to test), which one would achieve the minimal MSE, roughly?

1.3) §

Now, here are some executions for different values of k (shown in red is the hypothesis with the lowest MSE, among the k tested).

(A) k=1	(B) k=5
(C) k=20	(D) k=50

What happens as we increase k? Compare the four “best” linear regressors found by the random regression algorithm with different values of k chosen, which one does your group think is "best of the best"?
How does it match your initial guess about the best hypothesis?
Will this method eventually get arbitrarily close to the best solution? What do you think about the efficiency of this method?

2) Evaluation§

In considering a machine learning algorithm, we are sometimes interested in how different settings of the algorithm affect its performance. In Part 1 above, we gained some intuition informally about the impact of k, for a case when we had n = 10 training points. In this part, we'll more formally evaluate the impact of training set size n, when k is large.

We have an experimental procedure(n, m, d, k); the implementation in Python is not provided here, but it works as follows:

Generate a random regression problem with d input dimensions, which we'll use as a test for the algorithm. We do this by randomly drawing some hidden parameters \theta and \theta_0 which the learning algorithm will not get to observe.
Generate a data set (X, Y) with (n+m) data points. The X values are drawn uniformly from range [0, 1]. The output vector Y contains \theta^T x^{(i)} + \theta_0 + \epsilon^{(i)} for each column vector x^{(i)} in X, where \epsilon^{(i)} is some small random noise.
Randomly split the data set into two parts, the training set containing n data points, and the validation set containing m data points.
Run the random_regress(X, Y, k) algorithm on the training set to get estimated \theta and \theta_0.
Compute the MSE of \theta, \theta_0 on the validation set (which the learning algorithm didn't get to see).

2.1) §

Talk through this algorithm with your partner to be sure it makes sense. For instance, what should be the dimensions of \theta and \theta_0?

2.2) §

We will study some plots that show the performance of this algorithm and try to explain some observations we make of them.

We fix m = 100, d = 10, and k = 10000, and we run our procedure for different values of training set size n, with n ranging from 2 to 40.

The blue (solid lower) curve reports the mean squared error on the training set and the red (dashed upper) curve reports it on the validation set.

For each of the following observations, come up with an explanation; you may use the Potential Explanations (below) as a set of possible ideas for answers.

2.2.1) §

With small n relative to d, and large k, the training and validation errors are substantially different, with validation error much higher. Why?

2.2.2) §

As the amount of training data increases, the training error increases slightly. Why?

2.2.3) §

As the amount of training data increases, the validation error goes down a lot. Why?

Potential Explanations

A: With increasing the size of the training data, the learned hypothesis is more likely to be close to the true underlying hypothesis.

B: The variation of the errors (with respect to varying k) has to do with the random sampling of hypotheses -- sometimes when we run the learning algorithm (we run it independently once per experiment) it happens to get a good hypothesis, and sometimes it doesn't.

C: Because a small value of k is not guaranteed to generate a good hypothesis.

D: Because the error of a hypothesis on two different, but large, data sets from the same source is likely to be very similar, independent of whether it's a good or bad hypothesis.

E: Because with just a few points in a high-dimensional space, there are lots of different hypotheses that will have low error on the training set. This means we can generally find a hypothesis that goes near them. But when we draw a new set of points, they might be very far away from our hypothesis.

F: Because it is easier to get a small training error when you have a lot of parameters relative to the amount of training data.

Checkoff

Note that only one person per team needs to get into the queue for the checkoff, if you are doing the checkoff during lab time. You do not have to finish the checkoff during lab -- lab checkoffs are typically due the following Monday at 11pm.

Checkoff:
Have a check-off conversation with a staff member; mainly for you to get used to the checkoff mechanism we'll use the rest of the semester. We also love to chat a bit about the lab and all the questions above.

Survey

(The form below is to help us improve/calibrate for future assignments; submission is encouraged but not required. Thanks!)

How did you feel about the length of this lab?

Too long.

About right.

Too short.

How did you feel about the difficulty of this lab?

Too hard. We should tone it down.

About right.

Too easy. I want more challenge.

Do you have any feedback or comments about any questions in this lab? Anything else you want us to know?