Introduction to Machine Learning
Please Log In for full access to the web site.
Note that this link will take you to an external site (https://shimmer-mit-edu.ezproxy.canberra.edu.au) to authenticate, and then you will be redirected back to this page.
Due: Monday, September 09, 2024 at 11:00 PM
Type in your section passcode to get attendance creditLab attendance check
(normally within the first 15 minutes of your scheduled section; but today only, passcode is accepted anytime in-session.)
Passcode:
Instructions
In 6.390 a large part of the learning happens by discussing lab questions with partners. Please complete this group self-partnering question then begin the lab.
One person (and only one) should create a group. Everyone else should enter the group name below (in the form Join group: To join another group or leave your current group, reload the page.Create/join a group
groupname_0000
).
1) Lab: Random Hypotheses§
In this lab we will experiment with a simple algorithm for trying to find a line that fits well a set of data \mathcal{D}_n=\left\{\left(x^{(1)}, y^{(1)}\right), \ldots,\left(x^{(n)}, y^{(n)}\right)\right\} where x^{(i)}\in \mathbb{R} and y^{(i)}\in \mathbb{R}. We will generate a lot of random hypotheses within the linear hypothesis class -- i.e., several models with linear parametrization but each with randomly generated parameters. Or more specifically in this case, a lot of lines, each with randomly generated slope and intercept.
We will then see which of these randomly generated models has the smallest error on this data, and return that one as our answer. We don't recommend this method in practice, but it gets us started and makes some useful insights.
Note that this lab builds on the lecture notes Chapter 1 - Intro to ML and Chapter 2 - Regression for background and terminology, and the algorithm itself is a version of the random-regression algorithm from the notes Section 2.4. You'll be expected to read those notes before you start on Homework 1. In fact, Homework 1 will ask you to implement code that we're only describing or using here (you don't get to see all that code in this lab) -- so this lab will help get you ready for the homework!
It is good practice for you and your team to write down short answers to the lab questions as you work through them so that you are prepared for a check-off conversation at the end.
Here is the top-level code for the algorithm random_regress(X, Y, k)
defined below:
X
is a d \times n matrix with n training input points in a d-dimensional space.Y
is a 1 \times n vector with n training output values.k
is the number of random hypotheses to try.
All of this is normal Python and numpy
. This function uses
lin_reg_err
, which is one of those functions you will write in
Homework 1 (and where you'll also get more
practice with numpy
).
The random_regress
function below takes in the training data, and
k
hypotheses (in the ths
and th0s
arrays) and returns a vector
of k
values of the mean squared error (MSE) of each of the
hypotheses on the data.
In particular, consider one hypothesis (out of the k possible hypotheses). Suppose the parameters of this hypothesis are \theta and \theta_0; then its MSE is defined as
def random_regress(X, Y, k):
d, n = X.shape
# generate k random hypotheses
ths = np.random.randn(d, k)
th0s = np.random.randn(1, k)
# compute the mean squared error of each hypothesis on the data set
errors = lin_reg_err(X, Y, ths, th0s.T)
# Find the index of the hypotheses with the lowest error
i = np.argmin(errors)
# return the theta and theta0 parameters that define that hypothesis
theta, theta0 = ths[:,i:i+1], th0s[:,i:i+1]
return (theta, theta0), errors[i]
You'll get more practice with numpy in the homework, but in the meantime you might find the numpy reference for np.random.randn helpful.
1.1) §
Talk through the code for random_regress
with your group, and try to
get a basic understanding of what is going on. Put yourself in the
help queue to ask if you have any questions!
1.2) §
We'll start by studying example executions of this algorithm with a data set where d = 1, n = 10 for different values of k (so the horizontal axis is the single dimension of x and the vertical axis is the output y). The hypothesis with minimal mean squared error (MSE), of the k that were tested, is shown in red.
Here is the data set and one random hypothesis.
- Is this a good hypothesis?
- For all hypotheses within the linear hypothesis class (all including those we didn't get to test), which one would achieve the minimal MSE, roughly?
1.3) §
Now, here are some executions for different values of k (shown in red is the hypothesis with the lowest MSE, among the k tested).
|
|
- What happens as we increase
k
? Compare the four “best” linear regressors found by the random regression algorithm with different values ofk
chosen, which one does your group think is "best of the best"? - How does it match your initial guess about the best hypothesis?
- Will this method eventually get arbitrarily close to the best solution? What do you think about the efficiency of this method?
2) Evaluation§
In considering a machine learning algorithm, we are sometimes
interested in how different settings of the algorithm affect its
performance. In Part 1 above, we gained some intuition informally about
the impact of k
, for a case when we had n = 10 training points.
In this part, we'll more formally evaluate the impact of training
set size n
, when k
is large.
We have an experimental procedure(n, m, d, k)
; the
implementation in Python is not provided here, but it works as follows:
-
Generate a random regression problem with d input dimensions, which we'll use as a test for the algorithm. We do this by randomly drawing some hidden parameters \theta and \theta_0 which the learning algorithm will not get to observe.
-
Generate a data set (X, Y) with (n+m) data points. The X values are drawn uniformly from range [0, 1]. The output vector Y contains \theta^T x^{(i)} + \theta_0 + \epsilon^{(i)} for each column vector x^{(i)} in X, where \epsilon^{(i)} is some small random noise.
-
Randomly split the data set into two parts, the training set containing
n
data points, and the validation set containingm
data points. -
Run the
random_regress(X, Y, k)
algorithm on the training set to get estimated \theta and \theta_0. -
Compute the MSE of \theta, \theta_0 on the validation set (which the learning algorithm didn't get to see).
2.1) §
Talk through this algorithm with your partner to be sure it makes sense. For instance, what should be the dimensions of \theta and \theta_0?
2.2) §
We will study some plots that show the performance of this algorithm and try to explain some observations we make of them.
We fix m = 100
, d = 10
, and k = 10000
, and we run our
procedure
for different values of training set size n
, with n
ranging from 2 to 40.
The blue (solid lower) curve reports the mean squared error on the training set and the red (dashed upper) curve reports it on the validation set.
For each of the following observations, come up with an explanation; you may use the Potential Explanations (below) as a set of possible ideas for answers.
2.2.1) §
With smalln
relative to d
, and large k
, the training and
validation errors are substantially different, with validation error
much higher. Why?
2.2.2) §
As the amount of training data increases, the training error increases slightly. Why?2.2.3) §
As the amount of training data increases, the validation error goes down a lot. Why?Potential Explanations
A: With increasing the size of the training data, the learned hypothesis is more likely to be close to the true underlying hypothesis.
B: The variation of the errors (with respect to varying k
) has
to do with the random sampling of hypotheses -- sometimes when we run
the learning algorithm (we run it independently once per experiment)
it happens to get a good hypothesis, and sometimes it doesn't.
C: Because a small value of k
is not guaranteed to generate a
good hypothesis.
D: Because the error of a hypothesis on two different, but large, data sets from the same source is likely to be very similar, independent of whether it's a good or bad hypothesis.
E: Because with just a few points in a high-dimensional space, there are lots of different hypotheses that will have low error on the training set. This means we can generally find a hypothesis that goes near them. But when we draw a new set of points, they might be very far away from our hypothesis.
F: Because it is easier to get a small training error when you have a lot of parameters relative to the amount of training data.
Checkoff
Note that only one person per team needs to get into the queue for the checkoff, if you are doing the checkoff during lab time. You do not have to finish the checkoff during lab -- lab checkoffs are typically due the following Monday at 11pm.
Have a check-off conversation with a staff member; mainly for you to
get used to the checkoff mechanism we'll use the rest of the semester.
We also love to chat a bit about the lab and all the questions above.
Survey
(The form below is to help us improve/calibrate for future assignments; submission is encouraged but not required. Thanks!)