Tutorial 4: Nonlinear Dimensionality Reduction
Contents
Tutorial 4: Nonlinear Dimensionality Reduction¶
Week 1, Day 4: Dimensionality Reduction
By Neuromatch Academy
Content creators: Alex Cayco Gajic, John Murray
Content reviewers: Roozbeh Farhoudi, Matt Krause, Spiros Chavlis, Richard Gao, Michael Waskom, Siddharth Suresh, Natalie Schaworonkow, Ella Batty
Tutorial Objectives¶
Estimated timing of tutorial: 35 minutes
In this notebook we’ll explore how dimensionality reduction can be useful for visualizing and inferring structure in your data. To do this, we will compare PCA with t-SNE, a nonlinear dimensionality reduction method.
Overview:
Visualize MNIST in 2D using PCA.
Visualize MNIST in 2D using t-SNE.
Video 1: PCA Applications¶
Setup¶
⚠ Experimental LLM-enhanced tutorial ⚠
This notebook includes Neuromatch’s experimental Chatify 🤖 functionality. The Chatify notebook extension adds support for a large language model-based “coding tutor” to the materials. The tutor provides automatically generated text to help explain any code cell in this notebook.
Note that using Chatify may cause breaking changes and/or provide incorrect or misleading information. If you wish to proceed by installing and enabling the Chatify extension, you should run the next two code blocks (hidden by default). If you do not want to use this experimental version of the Neuromatch materials, please use the stable materials instead.
To use the Chatify helper, insert the %%explain
magic command at the start of any code cell and then run it (shift + enter) to access an interface for receiving LLM-based assitance. You can then select different options from the dropdown menus depending on what sort of assitance you want. To disable Chatify and run the code block as usual, simply delete the %%explain
command and re-run the cell.
Note that, by default, all of Chatify’s responses are generated locally. This often takes several minutes per response. Once you click the “Submit request” button, just be patient– stuff is happening even if you can’t see it right away!
Thanks for giving Chatify a try! Love it? Hate it? Either way, we’d love to hear from you about your Chatify experience! Please consider filling out our brief survey to provide feedback and help us make Chatify more awesome!
Run the next two cells to install and configure Chatify…
%pip install -q davos
import davos
davos.config.suppress_stdout = True
Note: you may need to restart the kernel to use updated packages.
smuggle chatify # pip: git+https://github.com/ContextLab/chatify.git
%load_ext chatify
Downloading and initializing model; this may take a few minutes...
llama.cpp: loading model from /home/runner/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-Chat-GGML/snapshots/501a3c8182cd256a859888fff4e838c049d5d7f6/llama-2-7b-chat.ggmlv3.q5_1.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.08 MB
llama_model_load_internal: mem required = 6390.60 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size = 256.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
# Imports
import numpy as np
import matplotlib.pyplot as plt
Figure Settings¶
#@title Figure Settings
import ipywidgets as widgets # interactive display
%config InlineBackend.figure_format = 'retina'
plt.style.use("https://raw.githubusercontent.com/NeuromatchAcademy/course-content/main/nma.mplstyle")
Plotting Functions¶
# @title Plotting Functions
def visualize_components(component1, component2, labels, show=True):
"""
Plots a 2D representation of the data for visualization with categories
labelled as different colors.
Args:
component1 (numpy array of floats) : Vector of component 1 scores
component2 (numpy array of floats) : Vector of component 2 scores
labels (numpy array of floats) : Vector corresponding to categories of
samples
Returns:
Nothing.
"""
plt.figure()
cmap = plt.cm.get_cmap('tab10')
plt.scatter(x=component1, y=component2, c=labels, cmap=cmap)
plt.xlabel('Component 1')
plt.ylabel('Component 2')
plt.colorbar(ticks=range(10))
plt.clim(-0.5, 9.5)
if show:
plt.show()
Section 1: Visualize MNIST in 2D using PCA¶
In this exercise, we’ll visualize the first few components of the MNIST dataset to look for evidence of structure in the data. But in this tutorial, we will also be interested in the label of each image (i.e., which numeral it is from 0 to 9). Start by running the following cell to reload the MNIST dataset (this takes a few seconds).
from sklearn.datasets import fetch_openml
# Get images
mnist = fetch_openml(name='mnist_784', as_frame=False)
X_all = mnist.data
# Get labels
labels_all = np.array([int(k) for k in mnist.target])
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/sklearn/datasets/_openml.py:1002: FutureWarning: The default value of `parser` will change from `'liac-arff'` to `'auto'` in 1.4. You can set `parser='auto'` to silence this warning. Therefore, an `ImportError` will be raised from 1.4 if the dataset is dense and pandas is not installed. Note that the pandas parser may return different data types. See the Notes Section in fetch_openml's API doc for details.
warn(
Note: We saved the complete dataset as X_all
and the labels as labels_all
.
To perform PCA, we now will use the method implemented in sklearn. Run the following cell to set the parameters of PCA - we will only look at the top 2 components because we will be visualizing the data in 2D.
from sklearn.decomposition import PCA
# Initializes PCA
pca_model = PCA(n_components=2)
# Performs PCA
pca_model.fit(X_all)
PCA(n_components=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(n_components=2)
Coding Exercise 1: Visualization of MNIST in 2D using PCA¶
Fill in the code below to perform PCA and visualize the top two components. For better visualization, take only the first 2,000 samples of the data (this will also make t-SNE much faster in the following section of the tutorial so don’t skip this step!)
Suggestions:
Truncate the data matrix at 2,000 samples. You will also need to truncate the array of labels.
Perform PCA on the truncated data.
Use the function
visualize_components
to plot the labeled data.
help(visualize_components)
help(pca_model.transform)
Help on function visualize_components in module __main__:
visualize_components(component1, component2, labels, show=True)
Plots a 2D representation of the data for visualization with categories
labelled as different colors.
Args:
component1 (numpy array of floats) : Vector of component 1 scores
component2 (numpy array of floats) : Vector of component 2 scores
labels (numpy array of floats) : Vector corresponding to categories of
samples
Returns:
Nothing.
Help on method transform in module sklearn.decomposition._base:
transform(X) method of sklearn.decomposition._pca.PCA instance
Apply dimensionality reduction to X.
X is projected on the first principal components previously extracted
from a training set.
Parameters
----------
X : array-like of shape (n_samples, n_features)
New data, where `n_samples` is the number of samples
and `n_features` is the number of features.
Returns
-------
X_new : array-like of shape (n_samples, n_components)
Projection of X in the first principal components, where `n_samples`
is the number of samples and `n_components` is the number of the components.
#################################################
## TODO for students: take only 2,000 samples and perform PCA
# Comment once you've completed the code
raise NotImplementedError("Student exercise: perform PCA")
#################################################
# Take only the first 2000 samples with the corresponding labels
X, labels = ...
# Perform PCA
scores = pca_model.transform(X)
# Plot the data and reconstruction
visualize_components(...)
Example output:
Think! 1: PCA Visualization¶
What do you see? Are different samples corresponding to the same numeral clustered together? Is there much overlap?
Do some pairs of numerals appear to be more distinguishable than others?
Section 2: Visualize MNIST in 2D using t-SNE¶
Estimated timing to here from start of tutorial: 15 min
Video 2: Nonlinear Methods¶
Next we will analyze the same data using t-SNE, a nonlinear dimensionality reduction method that is useful for visualizing high dimensional data in 2D or 3D. Run the cell below to get started.
from sklearn.manifold import TSNE
tsne_model = TSNE(n_components=2, perplexity=30, random_state=2020)
Coding Exercise 2.1: Apply t-SNE on MNIST¶
First, we’ll run t-SNE on the data to explore whether we can see more structure. The cell above defined the parameters that we will use to find our embedding (i.e, the low-dimensional representation of the data) and stored them in model
. To run t-SNE on our data, use the function model.fit_transform
.
Suggestions:
Run t-SNE using the function
model.fit_transform
.Plot the result data using
visualize_components
.
help(tsne_model.fit_transform)
Help on method fit_transform in module sklearn.manifold._t_sne:
fit_transform(X, y=None) method of sklearn.manifold._t_sne.TSNE instance
Fit X into an embedded space and return that transformed output.
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features) or (n_samples, n_samples)
If the metric is 'precomputed' X must be a square distance
matrix. Otherwise it contains a sample per row. If the method
is 'exact', X may be a sparse matrix of type 'csr', 'csc'
or 'coo'. If the method is 'barnes_hut' and the metric is
'precomputed', X may be a precomputed sparse graph.
y : None
Ignored.
Returns
-------
X_new : ndarray of shape (n_samples, n_components)
Embedding of the training data in low-dimensional space.
#################################################
## TODO for students
# Comment once you've completed the code
raise NotImplementedError("Student exercise: perform t-SNE")
#################################################
# Perform t-SNE
embed = ...
# Visualize the data
visualize_components(..., ..., labels)
Example output:
Coding Exercise 2.2: Run t-SNE with different perplexities¶
Unlike PCA, t-SNE has a free parameter (the perplexity) that roughly determines how global vs. local information is weighted. Here we’ll take a look at how the perplexity affects our interpretation of the results.
Steps:
Rerun t-SNE (don’t forget to re-initialize using the function
TSNE
as above) with a perplexity of 50, 5 and 2.
def explore_perplexity(values, X, labels):
"""
Plots a 2D representation of the data for visualization with categories
labeled as different colors using different perplexities.
Args:
values (list of floats) : list with perplexities to be visualized
X (np.ndarray of floats) : matrix with the dataset
labels (np.ndarray of int) : array with the labels
Returns:
Nothing.
"""
for perp in values:
#################################################
## TO DO for students: Insert your code here to redefine the t-SNE "model"
## while setting the perplexity perform t-SNE on the data and plot the
## results for perplexity = 50, 5, and 2 (set random_state to 2020
# Comment these lines when you complete the function
raise NotImplementedError("Student Exercise! Explore t-SNE with different perplexity")
#################################################
# Perform t-SNE
tsne_model = ...
embed = tsne_model.fit_transform(X)
visualize_components(embed[:, 0], embed[:, 1], labels, show=False)
plt.title(f"perplexity: {perp}")
# Visualize
values = [50, 5, 2]
explore_perplexity(values, X, labels)
Example output:
Think! 2: t-SNE Visualization¶
What changed compared to your previous results using perplexity equal to 50? Do you see any clusters that have a different structure than before?
What changed in the embedding structure for perplexity equals to 5 or 2?
Summary¶
Estimated timing of tutorial: 35 minutes
We learned the difference between linear and nonlinear dimensionality reduction. While nonlinear methods can be more powerful, they can also be sensitive to noise. In contrast, linear methods are useful for their simplicity and robustness.
We compared PCA and t-SNE for data visualization. Using t-SNE, we could visualize clusters in the data corresponding to different digits. While PCA was able to separate some clusters (e.g., 0 vs 1), it performed poorly overall.
However, the results of t-SNE can change depending on the choice of perplexity. To learn more, we recommend this Distill paper by Wattenberg, et al., 2016.