Thursday, March 27, 2014

Visualizing principal components with R and Sochi Olympic Athletes

Principal Components Analysis (PCA) is used as a dimensionality reduction method. Here we simply explain PCA step-by-step using data about Sochi Olympic Curlers.

It is hard to visualize a high dimensional space. When I took linear algebra, the book and teachers spoke about it as if were easy to visualize a hyperspace, but later when I took the Coursera course Neural Networks for Machine Learning, Geoffrey Hinton gave the wise advise, "To deal with a 14-dimensional space, visualize a 3-D space and say 'fourteen' to yourself very loudly. Everyone does it." In other words, people cannot visualize a high dimensional space, so we use a simpler problem—two dimensions of Olympic athlete data—to explain PCA.

First, we have one dimensional data where the only dimension is the curler's height.

Next, we add a second dimension: the curler's weight. Notice there is a strong correlation between height and weight. Because of this redundancy, two dimensions are not necessary to represent most of the information.

By the way, if you look carefully at the first two images, notice the horizontal placement of the curlers is identical: adding the second axis moves the curlers only vertically.

After performing PCA, there are two principal components. Because we want to simplify two dimensions into one dimension, we ignore the second principal component and plot the data onto the first component as red squares. The black lines join each original point (green) to its projection (red) onto a one-dimensional line.

The blue line illustrates the first principal component. Its on this one-dimensional line that the two-dimensional space is projected.

Now we can show the same projections from the previous graph on its own one-dimensional strip chart, which most of the variation of a two-dimensional space in one dimension.

So in general PCA reduces the number of dimensions by projecting high dimensional data into a lower dimensional space. With higher dimensional data, it is often useful to keep more of the principal components. For graphing, two or three principal components are retained. For other purposes, the optimal number of components may be chosen using a scree plot or the minimum number of components that captures some percentage of the variation, say 90%.

Here is the R code.


# Read data from CSV
# Download from http://www.danasilver.org/static/assets/sochi-2014-athletes/athletes.csv
# See below for faster option.
athletes <- read.csv('athletes.csv')

# Subset data
ath <- athletes[athletes$sport=='Curling',c('height','weight')]
ath <- ath[complete.cases(ath),]

# ALTERNATIVELY instead of downloading
ath <- structure(list(height = c(1.73, 1.78, 1.7, 1.73, 1.71, 1.93, 
1.7, 1.69, 1.84, 1.75, 1.83, 1.8, 1.8, 1.64), weight = c(66L, 
84L, 74L, 66L, 73L, 80L, 58L, 60L, 88L, 85L, 80L, 71L, 85L, 69L
)), .Names = c("height", "weight"), row.names = c(536L, 624L, 
640L, 820L, 930L, 949L, 1191L, 1632L, 1818L, 2349L, 2583L, 2609L, 
2641L, 2696L), class = "data.frame")

# Plot 1 Dimension (just height)
png('pca1-stripchart.png')
stripchart(ath$height, col="green", pch=19, cex=2, 
 xlab="Height (m)", 
 main="Curlers at Sochi 2014 Winter Olympics")
dev.off()

# Plot 2 Dimensions
x <- as.matrix(ath)
plot2d <- function(col=3)
{
 plot (x, asp = 0, col = col, pch = 19, cex = 2, 
  xlab="Height (m)", 
  ylab="Weight (kg)", 
  main="Curlers at Sochi 2014 Winter Olympics")
}
png('pca2-scatterplot.png')
plot2d()
dev.off()

# Perform PCA
pcX <- prcomp(x, retx = TRUE, scale = FALSE, center=TRUE)

# Transform points
transformed  <- pcX$x [,1] %*% t (pcX$rotation [1,])
transformed <- scale (transformed, center = -pcX$center, scale = FALSE)

# Plot PCA projection
plot_pca <- function()
{
 plot2d()
 points (transformed, col = 2, pch = 15, cex = 2)
 segments (x [,1],x [,2], transformed [,1], transformed [,2])
}
png('pca3-pca-projection.png')
plot_pca()
dev.off()

# Draw first principal component over scatterplot
png('pca4-first-component-on-scatterplot.png')
plot_pca()
lm.fit <- lm(transformed[,2] ~ transformed[,1])
abline(lm.fit, col="blue", cex=1.5)
dev.off()

# Plot first principal component by itself
png('pca5-first-component-stripchart.png')
stripchart(pcX$x[,1], col="red", cex=2, pch=15, 
 xlab="First principal component")
dev.off()

This was tested on R 3.0.2 (64-bit). Thank you to Dana Silver for the Sochi athlete data and to cbeleites for explaining how to plot PCA projections with line segments.

1 comment:

  1. Hi, Andrew.

    Thanks for your post. It is a very clear explanation.

    By testing your code, a doubt just arose:

    I build a version of the model "not centered":

    pcY <- prcomp (x, RETX = TRUE, scale = FALSE, center = FALSE)

    The when I compare with the "centered" version, I see two differences: 1) the rotation values ​​are not exactly the same and 2) signs in PC2 show changed:

    pcX (your model)
    Standard deviations:
    [1] 9.70482418 0.05875247

    Rotation:
    PC1 PC2
    height -0.004935072 -0.999987822
    weight -0.999987822 0.004935072


    pcY (version not centered)
    Standard deviations:
    [1] 77.6461087 0.1899913

    Rotation:
    PC1 PC2
    height -0.02340583 0.99972605
    weight -0.99972605 -0.02340583


    Now, if I calculate the coordinates to draw the first principal component, I do not get the same results. The line follows another path.

    I thought the first principal component collected as much variance as possible, and this was independent of whether the data have centered or not. Could you tell me the error in my reasoning?

    Thank you very much in advance and best regards.

    P. D. Sorry for my English.

    ReplyDelete

Snowflake SQL error: NULL result in a non-nullable column

Troubleshooting Snowflake SQL Error : NULL result in a non-nullable column When working with Snowflake, you might encounter the error mes...