Statistical Machine Learning by using Jupyter Notebook Problem

For this HW, we will pretend that we only have 1000 images of the MNIST dataset.

We will begin first with what we had in our videos:

#commonly used imports
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

#get data from sklearn
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, as_frame=False)

#get the attributes and labels
X, y = mnist["data"], mnist["target"]

#convert labels from characters to numbers
y = y.astype(np.uint8)

#split into training and test sets
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

Suppose we decide on a Decision Tree Classifier (covered later in Chapter 6). There are two hyperparameters we will fine tune for now: max_leaf_nodes and min_samples. Let’s fine tune these using the following code.

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

#setup the decision tree
dt_clf = DecisionTreeClassifier( random_state=30)

#the hyperparameters to search through
params = {'max_leaf_nodes': list(range(2, 100)), 'min_samples_split': [3,4, 5]}

#initialize the GridSearch with 3-fold cross validation
grid_search_cv = GridSearchCV(dt_clf, params, verbose=1, cv=3)

#do the search (but only on the first 1000 images)
grid_search_cv.fit(X_train[:1000], y_train[:1000])

What are the best hyperparameters?

Now, let’s fit the training data (again pretending to only have the first 1000 images).

dt_clf = grid_search_cv.best_estimator_
dt_clf.fit(X_train[:1000], y_train[:1000])

Let’s now obtain cross-validation accuracy scores with this fine-tuned model

from sklearn.model_selection import cross_val_score
cross_val_score(dt_clf, X_train[:1000], y_train[:1000], cv=3, scoring="accuracy")

How accurate are your three folds?

If we only had 1000 images and we wanted more to train on, we could do what is known as data augmentation. The augmentation we will do here is to shift the images we do have. For each of the 1000 images, we shift one pixel up, left, right, and down. So this will give us four more images to add to our training. Data augmentation such as this is a very useful technique when there are not enough training instances. If you run the following code, you will see an example of the first image shifted.

from scipy.ndimage.interpolation import shift
def shift_image(image, dx, dy):
image = image.reshape((28, 28))
shifted_image = shift(image, [dy, dx], cval=0, mode="constant")
return shifted_image.reshape([-1])

image = X_train[0]
shifted_image_down = shift_image(image, 0, 5)
shifted_image_left = shift_image(image, -5, 0)

plt.figure(figsize=(12,3))
plt.subplot(131)
plt.title("Original", fontsize=14)
plt.imshow(image.reshape(28, 28), interpolation="nearest", cmap="Greys")
plt.subplot(132)
plt.title("Shifted down", fontsize=14)
plt.imshow(shifted_image_down.reshape(28, 28), interpolation="nearest", cmap="Greys")
plt.subplot(133)
plt.title("Shifted left", fontsize=14)
plt.imshow(shifted_image_left.reshape(28, 28), interpolation="nearest", cmap="Greys")
plt.show()

To shift all of our images and add them to a training data set, we can use the following code:

X_train_augmented = [image for image in X_train[:1000]]
y_train_augmented = [label for label in y_train[:1000]]

for dx, dy in ((1, 0), (-1, 0), (0, 1), (0, -1)):
for image, label in zip(X_train[:1000], y_train[:1000]):
X_train_augmented.append(shift_image(image, dx, dy))
y_train_augmented.append(label)

X_train_augmented = np.array(X_train_augmented)
y_train_augmented = np.array(y_train_augmented)

You now have 5000 images in the X_train_augmented and y_train_augmented sets (the 1000 original images and each image was shifted up, down, left, and right).

Let’s now fine-tune and fit a decision tree classifier using this augmented training data.

dt_clf = DecisionTreeClassifier( random_state=30)
params = {'max_leaf_nodes': list(range(2, 100)), 'min_samples_split': [3,4, 5]}
grid_search_cv = GridSearchCV(dt_clf, params, verbose=1, cv=3)

#This will take around 10 minutes to run
grid_search_cv.fit(X_train_augmented, y_train_augmented)

grid_search_cv.best_estimator_

Now, what are the best hyperparameters?

Finally, obtain cross-validation accuracy scores with this model trained on the augmented data.

What are the accuracy scores now?

Did the augmented data help?

Think of one downside to using augmented data and comment.

  • Submit your code in a Jupyter notebook. Include all code: the code examples above and what you write.
  • Put your answers to the questions above into markdown cells.
  • Use a single hashmark heading to label the problem
We offer the bestcustom writing paper services. We have done this question before, we can also do it for you.

Why Choose Us

  • 100% non-plagiarized Papers
  • 24/7 /365 Service Available
  • Affordable Prices
  • Any Paper, Urgency, and Subject
  • Will complete your papers in 6 hours
  • On-time Delivery
  • Money-back and Privacy guarantees
  • Unlimited Amendments upon request
  • Satisfaction guarantee

How it Works

  • Click on the “Place Order” tab at the top menu or “Order Now” icon at the bottom and a new page will appear with an order form to be filled.
  • Fill in your paper’s requirements in the "PAPER DETAILS" section.
  • Fill in your paper’s academic level, deadline, and the required number of pages from the drop-down menus.
  • Click “CREATE ACCOUNT & SIGN IN” to enter your registration details and get an account with us for record-keeping and then, click on “PROCEED TO CHECKOUT” at the bottom of the page.
  • From there, the payment sections will show, follow the guided payment process and your order will be available for our writing team to work on it.