Caleb Hallinan - Performing Principal Component Analysis on Flag Images in R

In this project, we will embark on an adventure using Principal Component Analysis on images of various flags of countries/regions. Hopefully by the end you will have a better understanding of PCA and how we can utilize it to look at images in R!

Note: I condensed the R code to “Code” sections throughout for scrolling purposes. Feel free to click on these sections to look at each step, or change to “Show All Code” in the top right of the webpage!

Introduction

Principal Components Analysis (PCA) is a well-established technique to reduce the dimensions or features of a dataset while preserving the maximum amount of variation. This technique is widely used for pattern recognition, signal processing, and machine learning. PCA is also a useful tool in denoising, visualization, and classification of large datasets given features are linearly related. For example, imagine you have a genomics dataset with 10000+ genes and 5000+ observations. That’s a lot of data! However, not all of those genes will be important as likely many of them are correlated, or express similar variation, as others. PCA is a great tool to reduce the 10000+ dimensions to a smaller number, eliminating redundancy by reducing the dimensions of the dataset for more downstream analysis.

PCA is heavily based on linear algebra concepts. Essentially, PCA computes the eigenvectors of the covariance matrix of the data and sorts them by their eigenvalue (which correspond to the explained variance for that individual principal component). The principal components (PC) are then computed as linear combinations of the original variables using the eigenvectors. It sounds complicated, but I promise you it’s not as hard to understand as you think. This article was a great resource for me when I was first trying to figure it out. Unfortunately, I won’t go into much of the algebra behind PCA, so I encourage you to research it more! You can see this really cool animation above expressing what PCA is doing in two dimensions. Briefly, it is finding the direction that maximizes the variance of the blue dots which can also be viewed as minimizing the residuals of the blue dots to a line.

Data

In this project specifically, instead of working with gene expressions or large datasets with numerous features, we are looking at images of flags from different countries. The goal is to apply PCA to these flags to assess its effectiveness in preserving variation and possibly discover interesting patterns or insights in the principal components themselves.

To begin our exploration, we need to install and load some R packages. These packages will equip us with the necessary tools for working with data frames, functions, file paths, reading images, and image processing.

Code

### install and read in packages

library(tidyverse) # for dataframes
library(here) # for use of others if downloaded
library(png) # for reading png images
library(magick) # for reading in images
library(imager) # for plotting

Now that we have the necessary packages, let’s grab the flag data from here. This dataset contains flags of varying sizes within different folders, however we are going to look specifically at the “/png250px/” folder. This folder has 255 flags from various countries and regions.

Code

### Grabbing the data

# url to flag data, it is in zip file
url = "https://github.com/hampusborgos/country-flags/archive/refs/heads/main.zip"

# specify the file name and location where you want to save the file on your computer
file_name = "flags.zip"
file_path = here()

# use the download.file() function
download.file(url, paste(file_path, file_name, sep = "/"), mode = "wb")

# unzip zip file
unzip(paste0(here(), "/flags.zip"), exdir = here())

# get file names
files = list.files(here("country-flags-main/png250px/"), full.names = TRUE)

# Read each image file in the folder
image_list = lapply(files, image_read)
# image_list = lapply(files, readPNG)

For PCA to work soundly, it is essential that the dataset’s dimensions remain consistent. Of course, that’s not the case in this dataset where we see all flags have the same height (250px) but very different widths. Hence, we need to resize each flag to be the same height and width. I use the magick R package to read in the images and the “image_scale” along with the “image_convert” functions to transform the images into size 250x250x3 (representing height, width, and color channel). These resized images are then saved in the folder “/resized_png250px/.” To enable PCA analysis, the images are converted from matrix to vector format. This involves flattening the 250x250x3 image matrix into a single vector of size 1x187500 (250 x 250 x 3). The individual vectors for each image are then combined into a single matrix of dimensions 255x187500. This variable, “image_matrix” is created and saved as flags_matrix.RDS. Also to note, this matrix is country/region flag image x pixel of flag image.

Code

# NOTE: so each image is the same height but very different widths lol. Let's change that

# Set height and width I want
max_height = 250
max_width = 250

# Resize all the images to the specified height and width
# had to add matte=FALSE to get rid of extra channel
resized_images = lapply(image_list, function(im) {
  image_convert(image_scale(im, paste0(max_width, "x", max_height, "!")), format = "png", matte=FALSE)
})


# Create the directory to save the resized images
dir.create("resized_png250px")

# Save the resized images with the same names as the original files
for (i in seq_along(resized_images)) {
  # Extract the file name from the full path
  file_name = basename(files[i])
  # make the file path for saving the resized image
  save_path = file.path("resized_png250px", file_name)
  # Write the resized image to the specified file path
  image_write(resized_images[[i]], save_path)
}


# now grab them with png package
resized_files = list.files(here("blog/2023-10-20-pca-on-images/resized_png250px/"), full.names = TRUE)

# use function readPNG
imgs_final = lapply(resized_files, readPNG)

# QC: check each image is same dimensions
# for (i in seq_along(imgs_final)) {
#   dimensions = dim(imgs_final[[i]])
#   cat("Image", i, "Dimensions:", dimensions[1], "x", dimensions[2], "x", dimensions[3], "\n")
# }

# great!

# Get the number of images in the list
num_images = length(image_list)

# Create an empty matrix to store the flattened images
image_matrix = matrix(NA, nrow = num_images, ncol = 250 * 250 * 3)

# Flatten each image and store it as a column in the matrix
for (i in 1:num_images) {
  # get the flatten vector
  flattened_image = as.vector(imgs_final[[i]])
  # add to matrix
  image_matrix[i,] = flattened_image
}

# save as .rds file
# saveRDS(image_matrix, file = here("flags_matrix.RDS"))


# plot images

# want to add all images to this dataframe for ggplot
all_images = data.frame()

# for loop getting 5 examples
for (i in seq(1,190,20)) {
  # getting image - making cimg from imager package and then df
  img = as.data.frame(as.cimg(imgs_final[[i]]), wide = "c") |> 
    # # get rgb channels
    mutate(rgb_value = rgb(c.1, c.2, c.3)) |>
    # # make class labels
    mutate(img_num = paste("Image",i))
  
  all_images = rbind(all_images, img)
  
  # make the levels of img variable what i want
  all_images$img_num = factor(all_images$img_num, levels = sprintf("Image %d", seq(1, 190, 20)))

}

# plot image
print(ggplot(all_images,aes(y,x))+
  # getting rgb
  geom_raster(aes(fill=rgb_value))+
  # fill image
  scale_fill_identity() +
  # facet_wrap
  facet_wrap(.~ img_num, nrow = 2, ncol = 5) +
  # reverse y axis
  scale_y_reverse() + 
  # get rid of theme
  theme_void() +
  # bigger font size
  theme(strip.text = element_text(size = 12)))

Here are ten examples of the flag image data we are using.

PCA on the training data

Technically our “image_matrix” variable could be input straight into PCA the way it is. However, we are going to split the 255 images into a training and test set. Why do we do this? Well, for one it was part of the project description 😅 But really we do this to see how well PCA will predict, or project, the test data using only information from the training data. Our training set we take to be 75% of the data (191 flags) leaving 25% of data being the test images (64 flags). Finally, we are ready to conduct PCA on the training data! Note that we are setting center=TRUE and scale.=False. Briefly, the center hyperparameter shifts the data to be zero centered (subtracting the mean from each column) while scale will make the data have unit variance (correlation instead of covariance PCA). Centering is crucial for PCA to perform correctly, however scaling is dataset dependent. Take a look at the prcomp function description for more information.

Code

### Do PCA analysis ###

# Set the seed 
set.seed(123)

# Get the number of images in the list - for some reason needed to add this to this code chunk
num_images = dim(image_matrix)[1]

# Calculate the number of columns for training data
train_columns = floor(0.75 * num_images)

# Randomly select indices for the training data
train_indices = sample(1:num_images, train_columns)

# Get the test indices
test_indices = setdiff(1:num_images, train_indices)

# Split the image matrix into training and test data
train_data = image_matrix[train_indices, ]
test_data = image_matrix[-train_indices, ]

# Perform PCA on the training data, centering data but NOT scaling
pca_result = prcomp(train_data, center=TRUE, scale. = FALSE)

Let’s extract some key information form the prcomp function. We are able to calculate the proportion of variance explained from each PC by looking at the $sdev$ value, aka the standard deviations of the principal components, by squaring them and dividing by the sum of the squared $sdev$ values. To note, $sdev^2$ is actually equal to the eigenvalues of the dataset! We can then get the cumulative variation as the number of PC components increase, leading to the plot you see below. Here we see that as PCs increase so does the cumulative variance preserved/explained. This is an excellent visualization to check how much each PC contributes to preserving overall variation. In our case we see that with just 67 PCs we can explain 95% of the variation in the dataset!

Code

# Extract the proportion of variance explained by each principal component
variance_explained = pca_result$sdev^2 / sum(pca_result$sdev^2)

# Calculate the cumulative percentage of variance explained
cumulative_variance = cumsum(variance_explained) * 100

# get pcs getting more than 95% of the data
pcs_for_95 = which(cumsum(variance_explained) >= 0.95)[1]

# Create a tibble for plotting
data_plot = tibble(
  # x axis
  num_components = 1:length(cumulative_variance),
  # cumvar
  cumulative_variance = cumulative_variance
)

# plot using ggplot2
ggplot(data_plot, aes(x = num_components, y = cumulative_variance)) +
  # make line plot
  geom_line() +
  # add points
  geom_point() +
  # adding line for which PC number we are getting
  geom_vline(xintercept = pcs_for_95, color = "red", linetype = "dashed") +
  # adding text for PC im getting
  annotate("text", x = 70, y = 85, label = paste("95% variation explained\nwith", as.character(pcs_for_95), "PCs"), hjust = 0, vjust = 0, color = "black", size = 4) +
  # x axis every 20
  scale_x_continuous(breaks = seq(0, max(data_plot$num_components), by = 20)) +
  # y axis every 20
  scale_y_continuous(breaks = seq(0, 100, by = 20)) +
  # labels
  labs(x = "Number of Principal Components",
       y = "Cumulative Variance Explained (%)",
       title = "Cumulative Variance Explained with Additional Principal Components") +
  # theme
  theme_bw() +
  # change text vars
  theme(plot.title = element_text(hjust = 0.5, face = "bold"),
        axis.text = element_text(size = 10))

Another popular plot used to visualize PCs and their explained variation is a scree plot. Check out the wiki page for more information.

So we were able to successfully perform PCA on our image data… what now? Well, let’s see how well we can reconstruct the data using only the first 67 PCs, which contained 95% of the variance. To achieve this, we need to do a bit of matrix multiplication as well as add the mean back to un-center the data.

The PCA reconstructed data = the PC scores (matrix dimensions 191x191) x the transpose of the eigenvectors (matrix dimensions 191x187500) + mean

If we wanted to take only the top k PCs, which in our case I wanted the top 67, then we simply subset the PC scores and eigenvectors from 191 to 67. I then reconstruct the image matrix from the single image vector, normalize the values from 0-1, and save all the newly reconstructed image matrices in a single list.

Code

### reconstructing training data ###

# use pca_results to reconstruct training data fully
reconstructed_train_data_allpcs = pca_result$x %*% t(pca_result$rotation)
# scale data back to center
reconstructed_train_data_allpcs = scale(reconstructed_train_data_allpcs, center = -pca_result$center, scale = FALSE)

# now use just 95% variation explained pcs
reconstructed_train_data_95 = pca_result$x[,1:pcs_for_95] %*% t(pca_result$rotation[,1:pcs_for_95])
# scale data back to center
reconstructed_train_data_95 = scale(reconstructed_train_data_95, center = -pca_result$center, scale = FALSE)

# Create an empty list to store the 250x250x3 matrices
matrices_250x250x3 = vector("list", length(length(train_indices)))

# Generate a 250x250x3 matrix for each row in image_matrix
for (i in 1:length(train_indices)) {
  # flattened_image = as.vector(imgs_final[[i]])
  matrix_250x250x3 = array(reconstructed_train_data_allpcs[i,], dim = c(250, 250, 3))
  matrices_250x250x3[[i]] = matrix_250x250x3
}

# Function to normalize the matrix 
normalize_matrix = function(matrix) {
  min_value = min(matrix)
  max_value = max(matrix)
  normalized_matrix = (matrix - min_value) / (max_value - min_value)
  return(normalized_matrix)
}

# Normalize each matrix in the list
flags_reconstructed_train_data_allpcs = lapply(matrices_250x250x3, normalize_matrix)

# now do for 95% variation

# Create an empty list to store the 250x250x3 matrices
matrices_250x250x3 = vector("list", length(length(train_indices)))

# Generate a 250x250x3 matrix for each row in image_matrix
for (i in 1:length(train_indices)) {
  # flattened_image = as.vector(imgs_final[[i]])
  matrix_250x250x3 = array(reconstructed_train_data_95[i,], dim = c(250, 250, 3))
  matrices_250x250x3[[i]] = matrix_250x250x3
}

# Normalize each matrix in the list
flags_reconstructed_train_data_95 = lapply(matrices_250x250x3, normalize_matrix)

Now let’s view how well PCA reconstructed the data! I utilized ggplot to plot three versions of five different flags: the original resized training image, the reconstructed resized training image with all PCs, the reconstructed resized training image with 67 PCs.

Code

# for loop getting 5 examples
for (i in 1:5) {
  
  # getting image - making cimg from imager package and then df
  img1 = as.data.frame(as.cimg(imgs_final[[train_indices[i]]]), wide = "c")  |> 
    # get rgb channels
    mutate(rgb_value = rgb(c.1, c.2, c.3)) |>
    # make class for facet_wrap
    mutate(image_type = "Original Image")
  
  # getting image - making cimg from imager package and then df
  img2 = as.data.frame(as.cimg(flags_reconstructed_train_data_allpcs[[i]]), wide = "c") |>
    # get rgb channels
    mutate(rgb_value = rgb(c.1, c.2, c.3)) |>
    # make class for facet_wrap
    mutate(image_type = "Fully Reconstructed Image")

  # getting image 1 - making cimg from imager package and then df
  img3 = as.data.frame(as.cimg(flags_reconstructed_train_data_95[[i]]), wide = "c") |>
    # get rgb channels
    mutate(rgb_value = rgb(c.1, c.2, c.3)) |>
    # make class for facet_wrap
    mutate(image_type = "67 PCs Reconstructed Image")
  
  # combining to all images
  all_images = rbind(img1, img2, img3)
  
  # make the levels of img variable what i want
  all_images$image_type = factor(all_images$image_type, levels = c("Original Image", "Fully Reconstructed Image","67 PCs Reconstructed Image"))

  # plot image
  print(ggplot(all_images,aes(y,x))+
    # getting rgb
    geom_raster(aes(fill=rgb_value))+
    # fill image
    scale_fill_identity() +
    # reverse y axis
    scale_y_reverse() + 
    # wrap by image
    facet_wrap(.~image_type) +
    # get rid of theme
    theme_void() +
    # bigger font size
    theme(strip.text = element_text(size = 12)))
}

What do you think? Is 67 PCs enough to reconstruct the data? Since all 191 PCs were used to reconstruct the flags in the middle, we expect it to look identical to the original image as it preserves 100% of the variation. However, using only 67 PCs, or 95% of the variation of the flag images, we see more grainy images with artifacts from other flags. A lot of the reconstructed flags seems to have X’s on them along with a symbol from another flag within the dataset. If my goal was to compress these images by reducing the dimensions using PCA, I personally would use more PCs to do so. However, these reconstructed flags might be enough if one were trying to do classification or another machine learning task. So it really depends on what the goal of your project is to determine how many PCs to keep.

PCA on the test data

Now let’s see how well the PCA on the training data performs on the testing data! To reconstruct the test data, we first use the “predict” function in R to obtain the PC scores for the training data using the PCA results from the training data. We can then use the same formula we used for training set to get the reconstructed data for the testing set. Finally, we can plot these flags using ggplot to visualize the reconstructed data.

Code

### Project testing data ###

# first predict the testing data
test_data_projected = predict(pca_result, newdata = test_data)

# Reconstruct the test_data from the projected data using all pcs
reconstructed_test_data_allpcs = test_data_projected %*% t(pca_result$rotation)

# reconstruct using 95%
reconstructed_test_data_95 = test_data_projected[,1:pcs_for_95] %*% t(pca_result$rotation[,1:pcs_for_95])

# Create an empty list to store the 250x250x3 matrices
matrices_250x250x3 = vector("list", length(length(test_indices)))

# Generate a 250x250x3 matrix for each row in image_matrix
for (i in 1:length(test_indices)) {
  # flattened_image = as.vector(imgs_final[[i]])
  matrix_250x250x3 = array(reconstructed_test_data_allpcs[i,], dim = c(250, 250, 3))
  matrices_250x250x3[[i]] = matrix_250x250x3
}

# Normalize each matrix in the list
flags_reconstructed_test_data_allpcs = lapply(matrices_250x250x3, normalize_matrix)


# now with 95%


# Create an empty list to store the 250x250x3 matrices
matrices_250x250x3 = vector("list", length(length(test_indices)))

# Generate a 250x250x3 matrix for each row in image_matrix
for (i in 1:length(test_indices)) {
  # flattened_image = as.vector(imgs_final[[i]])
  matrix_250x250x3 = array(reconstructed_test_data_95[i,], dim = c(250, 250, 3))
  matrices_250x250x3[[i]] = matrix_250x250x3
}

# Normalize each matrix in the list
flags_reconstructed_test_data_95 = lapply(matrices_250x250x3, normalize_matrix)

# for loop getting 5 examples
for (i in 1:5) {
  
  # getting image - making cimg from imager package and then df
  img1 = as.data.frame(as.cimg(imgs_final[[test_indices[i]]]), wide = "c")  |> 
    # get rgb channels
    mutate(rgb_value = rgb(c.1, c.2, c.3)) |>
    # make class for facet_wrap
    mutate(image_type = "Original Image")
  
  # getting image - making cimg from imager package and then df
  img2 = as.data.frame(as.cimg(flags_reconstructed_test_data_allpcs[[i]]), wide = "c") |>
    # get rgb channels
    mutate(rgb_value = rgb(c.1, c.2, c.3)) |>
    # make class for facet_wrap
    mutate(image_type = "Fully Reconstructed Image")

  # getting image 1 - making cimg from imager package and then df
  img3 = as.data.frame(as.cimg(flags_reconstructed_test_data_95[[i]]), wide = "c") |>
    # get rgb channels
    mutate(rgb_value = rgb(c.1, c.2, c.3)) |>
    # make class for facet_wrap
    mutate(image_type = "67 PCs Reconstructed Image")
  
  # combining to all images
  all_images = rbind(img1, img2, img3)
  
  # make the levels of img variable what i want
  all_images$image_type = factor(all_images$image_type, levels = c("Original Image", "Fully Reconstructed Image","67 PCs Reconstructed Image"))

  # plot image
  print(ggplot(all_images,aes(y,x))+
    # getting rgb
    geom_raster(aes(fill=rgb_value))+
    # fill image
    scale_fill_identity() +
    # reverse y axis
    scale_y_reverse() + 
    # wrap by image
    facet_wrap(.~image_type) +
    # get rid of theme
    theme_void() +
    # bigger font size
    theme(strip.text = element_text(size = 12)))
}

Now you can observe that even when using all 191 PCs to reconstruct the test data, the result isn’t perfect. However, this makes sense! Since PCA was not performed on the testing data it likely contains new and unseen information for the algorithm. In other words, there exists variation within this new test dataset that was not present in the training dataset. Therefore, achieving a perfectly reconstructed image is not possible.

I would argue that the difference between using 191 PCs or 67 PCs in the test case isn’t very significant. Particularly when comparing it to the notable differences in the training dataset, the reconstruction quality doesn’t appear to be too bad. There is one interesting observation we can make regarding the last image featuring the trident in the center. It’s clear that the reconstructed data was unable to accurately recreate this symbol. Instead, we see a depiction of a star in the middle image and a distinct-ish symbol in the far-right image. This suggests that there likely wasn’t a similar symbol present in the training dataset, causing PCA to struggle in reconstructing it.

Visualizing the principal components

To finish up, lets take a look at the top 10 PCs in image form. To do this, we simply look at the transpose of the rotation variable within pca_results. We transform this single vector into image form via the same process as before, and visualize using ggplot.

Code

### Plot PCs ###

# get pcs
principal_components = t(pca_result$rotation)

# Create an empty list to store the 250x250x3 matrices
matrices_250x250x3 = vector("list", dim(principal_components)[1])

# Generate a 250x250x3 matrix for each row in image_matrix
for (i in 1:dim(principal_components)[1]) {
  # flattened_image = as.vector(imgs_final[[i]])
  matrix_250x250x3 = array(principal_components[i,], dim = c(250, 250, 3))
  matrices_250x250x3[[i]] = matrix_250x250x3
}

# Normalize each matrix in the list
flags_principal_components = lapply(matrices_250x250x3, normalize_matrix)

# want to add all images to this dataframe for ggplot
all_images = data.frame()

# for loop getting 5 examples
for (i in 1:10) {
  # getting image - making cimg from imager package and then df
  img = as.data.frame(as.cimg(flags_principal_components[[i]]), wide = "c")  |> 
    # get rgb channels
    mutate(rgb_value = rgb(c.1, c.2, c.3)) |> 
    # make class labels
    mutate(img_num = paste("PC",i))
  
  all_images = rbind(all_images, img)
  
  # make the levels of img variable what i want
  all_images$img_num = factor(all_images$img_num, levels = c("PC 1","PC 2","PC 3","PC 4","PC 5",
                                                                   "PC 6","PC 7","PC 8","PC 9","PC 10"))
}

# plot image
print(ggplot(all_images,aes(y,x))+
  # getting rgb
  geom_raster(aes(fill=rgb_value))+
  # fill image
  scale_fill_identity() +
  # facet_wrap
  facet_wrap(.~ img_num, nrow = 2, ncol = 5) +
  # reverse y axis
  scale_y_reverse() + 
  # get rid of theme
  theme_void() +
  # bigger font size
  theme(strip.text = element_text(size = 12)))

Personally, I think this is the coolest part! It’s amazing to visually observe the variations captured by each PC. One of the most compelling examples that caught my attention is the comparison between PC6 and PC9. Both PCs capture the flag symbol in the top left corner and exhibit similar colors. However, PC6 splits its colors horizontally, while PC9 splits them vertically. They complement each other in a nice way, showcasing the diverse ways information is reflected across the PCs! Some other PCs, such as PC3 and PC4, have a striking resemblance to specific flag images they were trained on. On the other hand, PCs like PC7 and PC10 seem to combine elements from two or more flags in the dataset, creating cool hybrid representations.

These results actually made me curious about what the appearance of the PCs beyond the top 10 would look like. So let’s plot ten of them side by side and compare! This will hopefully provide further insights into the patterns and variations captured by these additional PCs.

Code

### Plot PCs ###

# want to add all images to this dataframe for ggplot
all_images = data.frame()

# for loop getting 5 examples
for (i in seq(11,191,20)) {
  # getting image - making cimg from imager package and then df
  img = as.data.frame(as.cimg(flags_principal_components[[i]]), wide = "c")  |> 
    # get rgb channels
    mutate(rgb_value = rgb(c.1, c.2, c.3)) |> 
    # make class labels
    mutate(img_num = paste("PC",i))
  
  all_images = rbind(all_images, img)
  
  # make the levels of img variable what i want
  all_images$img_num = factor(all_images$img_num, levels = sprintf("PC %d", seq(11, 191, 20)))
}

# plot image
print(ggplot(all_images,aes(y,x))+
  # getting rgb
  geom_raster(aes(fill=rgb_value))+
  # fill image
  scale_fill_identity() +
  # facet_wrap
  facet_wrap(.~ img_num, nrow = 2, ncol = 5) +
  # reverse y axis
  scale_y_reverse() + 
  # get rid of theme
  theme_void() +
  # bigger font size
  theme(strip.text = element_text(size = 12)))

Except for PC11, which is only one rank away from the top 10, these PCs exhibit distinct characteristics when compared to the top PCs. Rather than focusing on the overall color of the flag images, they appear to emphasize patterns within the dataset. Interestingly, the final PC essentially corresponds to a completely black image. Checking our variance_explained variable, we find it accounts for 7.620894e-35, which is approximately 0% of the dataset highlighting its lack of information. Additionally, many of the later PCs feature specific symbols like stars and shields, which may explain why our reconstructed test data failed to capture the trident but instead exhibited various other symbols such as a star. This serves as an important reminder that predicting new data can be challenging, particularly when it introduces novel variations or, in our case, flag symbols in images that were previously unseen by PCA. This concept holds true across most (if not all) machine learning applications, especially deep learning where the availability of comprehensive training datasets can be a major constraint.

Final thoughts

Honestly, this was a really fun project to do! It really helped me understand PCA better and how it can be utilized with images. I also was able to work on my R skills using matrix multiplication, imaging packages, and ggplot! I hope this project will help if you are struggling to understand how to do PCA on images in R. I did some searching and it was hard to find a good resource that went through step-by-step of the process… so hopefully this helped and my code is commented well enough for anyone to understand 😊 Let me know if you have any questions about the process or why I did what I did!

Thanks for reading 🥳🥳

Caleb 🎲

--- title: "Performing Principal Component Analysis on Flag Images in R" author: "Caleb Hallinan" date: 10/19/2023 format: html: code-fold: true code-summary: "Code" code-tools: true embed-resources: true fontsize: 12pt geometry: margin=1.5in fontcolor: white image: preview_image.png categories: [project] ---  ```{r global options, include = FALSE} knitr::opts_chunk$set(echo=TRUE, include = TRUE, warning=FALSE, message=FALSE) ``` In this project, we will embark on an adventure using Principal Component Analysis on images of various flags of countries/regions. Hopefully by the end you will have a better understanding of PCA and how we can utilize it to look at images in R! Note: I condensed the R code to "Code" sections throughout for scrolling purposes. Feel free to click on these sections to look at each step, or change to "Show All Code" in the top right of the webpage! ## Introduction Principal Components Analysis (PCA) is a well-established technique to reduce the dimensions or features of a dataset while preserving the maximum amount of variation. This technique is widely used for pattern recognition, signal processing, and machine learning. PCA is also a useful tool in denoising, visualization, and classification of large datasets given features are linearly related. For example, imagine you have a genomics dataset with 10000+ genes and 5000+ observations. That's a lot of data! However, not all of those genes will be important as likely many of them are correlated, or express similar variation, as others. PCA is a great tool to reduce the 10000+ dimensions to a smaller number, eliminating redundancy by reducing the dimensions of the dataset for more downstream analysis.  ::: {style="text-align: center;"} ![[Image Source](https://medium.com/analytics-vidhya/dimensionality-reduction-principal-component-analysis-d1402b58feb1)](pca.gif){width="600"} ::: PCA is heavily based on linear algebra concepts. Essentially, PCA computes the eigenvectors of the covariance matrix of the data and sorts them by their eigenvalue (which correspond to the explained variance for that individual principal component). The principal components (PC) are then computed as linear combinations of the original variables using the eigenvectors. It sounds complicated, but I promise you it's not as hard to understand as you think. This [article](https://builtin.com/data-science/step-step-explanation-principal-component-analysis) was a great resource for me when I was first trying to figure it out. Unfortunately, I won't go into much of the algebra behind PCA, so I encourage you to research it more! You can see this really cool animation above expressing what PCA is doing in two dimensions. Briefly, it is finding the direction that maximizes the variance of the blue dots which can also be viewed as minimizing the residuals of the blue dots to a line. ## Data In this project specifically, instead of working with gene expressions or large datasets with numerous features, we are looking at images of flags from different countries. The goal is to apply PCA to these flags to assess its effectiveness in preserving variation and possibly discover interesting patterns or insights in the principal components themselves. To begin our exploration, we need to install and load some R packages. These packages will equip us with the necessary tools for working with data frames, functions, file paths, reading images, and image processing.  ```{r, read_packages} ### install and read in packages library(tidyverse) # for dataframes library(here) # for use of others if downloaded library(png) # for reading png images library(magick) # for reading in images library(imager) # for plotting ``` Now that we have the necessary packages, let's grab the flag data from [here](https://github.com/hampusborgos/country-flags). This dataset contains flags of varying sizes within different folders, however we are going to look specifically at the "/png250px/" folder. This folder has 255 flags from various countries and regions.  ```{r, rawdata, eval=TRUE} ### Grabbing the data # url to flag data, it is in zip file url = "https://github.com/hampusborgos/country-flags/archive/refs/heads/main.zip" # specify the file name and location where you want to save the file on your computer file_name = "flags.zip" file_path = here() # use the download.file() function download.file(url, paste(file_path, file_name, sep = "/"), mode = "wb") # unzip zip file unzip(paste0(here(), "/flags.zip"), exdir = here()) # get file names files = list.files(here("country-flags-main/png250px/"), full.names = TRUE) # Read each image file in the folder image_list = lapply(files, image_read) # image_list = lapply(files, readPNG) ``` For PCA to work soundly, it is essential that the dataset's dimensions remain consistent. Of course, that's not the case in this dataset where we see all flags have the same height (250px) but very different widths. Hence, we need to resize each flag to be the same height and width. I use the magick R package to read in the images and the "image_scale" along with the "image_convert" functions to transform the images into size 250x250x3 (representing height, width, and color channel). These resized images are then saved in the folder "/resized_png250px/." To enable PCA analysis, the images are converted from matrix to vector format. This involves flattening the 250x250x3 image matrix into a single vector of size 1x187500 (250 x 250 x 3). The individual vectors for each image are then combined into a single matrix of dimensions 255x187500. This variable, "image_matrix" is created and saved as flags_matrix.RDS. Also to note, this matrix is country/region flag image x pixel of flag image. ```{r, fig.height = 3, fig.width = 9, fig.align='center'} # NOTE: so each image is the same height but very different widths lol. Let's change that # Set height and width I want max_height = 250 max_width = 250 # Resize all the images to the specified height and width # had to add matte=FALSE to get rid of extra channel resized_images = lapply(image_list, function(im) { image_convert(image_scale(im, paste0(max_width, "x", max_height, "!")), format = "png", matte=FALSE) }) # Create the directory to save the resized images dir.create("resized_png250px") # Save the resized images with the same names as the original files for (i in seq_along(resized_images)) { # Extract the file name from the full path file_name = basename(files[i]) # make the file path for saving the resized image save_path = file.path("resized_png250px", file_name) # Write the resized image to the specified file path image_write(resized_images[[i]], save_path) } # now grab them with png package resized_files = list.files(here("blog/2023-10-20-pca-on-images/resized_png250px/"), full.names = TRUE) # use function readPNG imgs_final = lapply(resized_files, readPNG) # QC: check each image is same dimensions # for (i in seq_along(imgs_final)) { # dimensions = dim(imgs_final[[i]]) # cat("Image", i, "Dimensions:", dimensions[1], "x", dimensions[2], "x", dimensions[3], "\n") # } # great! # Get the number of images in the list num_images = length(image_list) # Create an empty matrix to store the flattened images image_matrix = matrix(NA, nrow = num_images, ncol = 250 * 250 * 3) # Flatten each image and store it as a column in the matrix for (i in 1:num_images) { # get the flatten vector flattened_image = as.vector(imgs_final[[i]]) # add to matrix image_matrix[i,] = flattened_image } # save as .rds file # saveRDS(image_matrix, file = here("flags_matrix.RDS")) # plot images # want to add all images to this dataframe for ggplot all_images = data.frame() # for loop getting 5 examples for (i in seq(1,190,20)) { # getting image - making cimg from imager package and then df img = as.data.frame(as.cimg(imgs_final[[i]]), wide = "c") |> # # get rgb channels mutate(rgb_value = rgb(c.1, c.2, c.3)) |> # # make class labels mutate(img_num = paste("Image",i)) all_images = rbind(all_images, img) # make the levels of img variable what i want all_images$img_num = factor(all_images$img_num, levels = sprintf("Image %d", seq(1, 190, 20))) } # plot image print(ggplot(all_images,aes(y,x))+ # getting rgb geom_raster(aes(fill=rgb_value))+ # fill image scale_fill_identity() + # facet_wrap facet_wrap(.~ img_num, nrow = 2, ncol = 5) + # reverse y axis scale_y_reverse() + # get rid of theme theme_void() + # bigger font size theme(strip.text = element_text(size = 12))) ``` Here are ten examples of the flag image data we are using. ## PCA on the training data Technically our "image_matrix" variable could be input straight into PCA the way it is. However, we are going to split the 255 images into a training and test set. Why do we do this? Well, for one it was part of the project description 😅 But really we do this to see how well PCA will predict, or project, the test data using only information from the training data. Our training set we take to be 75% of the data (191 flags) leaving 25% of data being the test images (64 flags). Finally, we are ready to conduct PCA on the training data! Note that we are setting center=TRUE and scale.=False. Briefly, the center hyperparameter shifts the data to be zero centered (subtracting the mean from each column) while scale will make the data have unit variance (correlation instead of covariance PCA). Centering is crucial for PCA to perform correctly, however scaling is dataset dependent. Take a look at the [prcomp function description](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/prcomp) for more information.  ```{r} ### Do PCA analysis ### # Set the seed set.seed(123) # Get the number of images in the list - for some reason needed to add this to this code chunk num_images = dim(image_matrix)[1] # Calculate the number of columns for training data train_columns = floor(0.75 * num_images) # Randomly select indices for the training data train_indices = sample(1:num_images, train_columns) # Get the test indices test_indices = setdiff(1:num_images, train_indices) # Split the image matrix into training and test data train_data = image_matrix[train_indices, ] test_data = image_matrix[-train_indices, ] # Perform PCA on the training data, centering data but NOT scaling pca_result = prcomp(train_data, center=TRUE, scale. = FALSE) ``` Let's extract some key information form the prcomp function. We are able to calculate the proportion of variance explained from each PC by looking at the $sdev$ value, aka the standard deviations of the principal components, by squaring them and dividing by the sum of the squared $sdev$ values. To note, $sdev^2$ is actually equal to the eigenvalues of the dataset! We can then get the cumulative variation as the number of PC components increase, leading to the plot you see below. Here we see that as PCs increase so does the cumulative variance preserved/explained. This is an excellent visualization to check how much each PC contributes to preserving overall variation. In our case we see that with just 67 PCs we can explain 95% of the variation in the dataset! ```{r, fig.align='center'} # Extract the proportion of variance explained by each principal component variance_explained = pca_result$sdev^2 / sum(pca_result$sdev^2) # Calculate the cumulative percentage of variance explained cumulative_variance = cumsum(variance_explained) * 100 # get pcs getting more than 95% of the data pcs_for_95 = which(cumsum(variance_explained) >= 0.95)[1] # Create a tibble for plotting data_plot = tibble( # x axis num_components = 1:length(cumulative_variance), # cumvar cumulative_variance = cumulative_variance ) # plot using ggplot2 ggplot(data_plot, aes(x = num_components, y = cumulative_variance)) + # make line plot geom_line() + # add points geom_point() + # adding line for which PC number we are getting geom_vline(xintercept = pcs_for_95, color = "red", linetype = "dashed") + # adding text for PC im getting annotate("text", x = 70, y = 85, label = paste("95% variation explained\nwith", as.character(pcs_for_95), "PCs"), hjust = 0, vjust = 0, color = "black", size = 4) + # x axis every 20 scale_x_continuous(breaks = seq(0, max(data_plot$num_components), by = 20)) + # y axis every 20 scale_y_continuous(breaks = seq(0, 100, by = 20)) + # labels labs(x = "Number of Principal Components", y = "Cumulative Variance Explained (%)", title = "Cumulative Variance Explained with Additional Principal Components") + # theme theme_bw() + # change text vars theme(plot.title = element_text(hjust = 0.5, face = "bold"), axis.text = element_text(size = 10)) ``` Another popular plot used to visualize PCs and their explained variation is a scree plot. Check out the [wiki page](https://en.wikipedia.org/wiki/Scree_plot) for more information. So we were able to successfully perform PCA on our image data... what now? Well, let's see how well we can reconstruct the data using only the first 67 PCs, which contained 95% of the variance. To achieve this, we need to do a bit of matrix multiplication as well as add the mean back to un-center the data. The PCA reconstructed data = the PC scores (matrix dimensions 191x191) x the transpose of the eigenvectors (matrix dimensions 191x187500) + mean If we wanted to take only the top k PCs, which in our case I wanted the top 67, then we simply subset the PC scores and eigenvectors from 191 to 67. I then reconstruct the image matrix from the single image vector, normalize the values from 0-1, and save all the newly reconstructed image matrices in a single list. ```{r} ### reconstructing training data ### # use pca_results to reconstruct training data fully reconstructed_train_data_allpcs = pca_result$x %*% t(pca_result$rotation) # scale data back to center reconstructed_train_data_allpcs = scale(reconstructed_train_data_allpcs, center = -pca_result$center, scale = FALSE) # now use just 95% variation explained pcs reconstructed_train_data_95 = pca_result$x[,1:pcs_for_95] %*% t(pca_result$rotation[,1:pcs_for_95]) # scale data back to center reconstructed_train_data_95 = scale(reconstructed_train_data_95, center = -pca_result$center, scale = FALSE) # Create an empty list to store the 250x250x3 matrices matrices_250x250x3 = vector("list", length(length(train_indices))) # Generate a 250x250x3 matrix for each row in image_matrix for (i in 1:length(train_indices)) { # flattened_image = as.vector(imgs_final[[i]]) matrix_250x250x3 = array(reconstructed_train_data_allpcs[i,], dim = c(250, 250, 3)) matrices_250x250x3[[i]] = matrix_250x250x3 } # Function to normalize the matrix normalize_matrix = function(matrix) { min_value = min(matrix) max_value = max(matrix) normalized_matrix = (matrix - min_value) / (max_value - min_value) return(normalized_matrix) } # Normalize each matrix in the list flags_reconstructed_train_data_allpcs = lapply(matrices_250x250x3, normalize_matrix) # now do for 95% variation # Create an empty list to store the 250x250x3 matrices matrices_250x250x3 = vector("list", length(length(train_indices))) # Generate a 250x250x3 matrix for each row in image_matrix for (i in 1:length(train_indices)) { # flattened_image = as.vector(imgs_final[[i]]) matrix_250x250x3 = array(reconstructed_train_data_95[i,], dim = c(250, 250, 3)) matrices_250x250x3[[i]] = matrix_250x250x3 } # Normalize each matrix in the list flags_reconstructed_train_data_95 = lapply(matrices_250x250x3, normalize_matrix) ``` Now let's view how well PCA reconstructed the data! I utilized ggplot to plot three versions of five different flags: the original resized training image, the reconstructed resized training image with all PCs, the reconstructed resized training image with 67 PCs. ```{r, fig.height = 2, fig.width = 9, fig.align='center'} # for loop getting 5 examples for (i in 1:5) { # getting image - making cimg from imager package and then df img1 = as.data.frame(as.cimg(imgs_final[[train_indices[i]]]), wide = "c") |> # get rgb channels mutate(rgb_value = rgb(c.1, c.2, c.3)) |> # make class for facet_wrap mutate(image_type = "Original Image") # getting image - making cimg from imager package and then df img2 = as.data.frame(as.cimg(flags_reconstructed_train_data_allpcs[[i]]), wide = "c") |> # get rgb channels mutate(rgb_value = rgb(c.1, c.2, c.3)) |> # make class for facet_wrap mutate(image_type = "Fully Reconstructed Image") # getting image 1 - making cimg from imager package and then df img3 = as.data.frame(as.cimg(flags_reconstructed_train_data_95[[i]]), wide = "c") |> # get rgb channels mutate(rgb_value = rgb(c.1, c.2, c.3)) |> # make class for facet_wrap mutate(image_type = "67 PCs Reconstructed Image") # combining to all images all_images = rbind(img1, img2, img3) # make the levels of img variable what i want all_images$image_type = factor(all_images$image_type, levels = c("Original Image", "Fully Reconstructed Image","67 PCs Reconstructed Image")) # plot image print(ggplot(all_images,aes(y,x))+ # getting rgb geom_raster(aes(fill=rgb_value))+ # fill image scale_fill_identity() + # reverse y axis scale_y_reverse() + # wrap by image facet_wrap(.~image_type) + # get rid of theme theme_void() + # bigger font size theme(strip.text = element_text(size = 12))) } ``` What do you think? Is 67 PCs enough to reconstruct the data? Since all 191 PCs were used to reconstruct the flags in the middle, we expect it to look identical to the original image as it preserves 100% of the variation. However, using only 67 PCs, or 95% of the variation of the flag images, we see more grainy images with artifacts from other flags. A lot of the reconstructed flags seems to have X's on them along with a symbol from another flag within the dataset. If my goal was to compress these images by reducing the dimensions using PCA, I personally would use more PCs to do so. However, these reconstructed flags might be enough if one were trying to do classification or another machine learning task. So it really depends on what the goal of your project is to determine how many PCs to keep. ## PCA on the test data Now let's see how well the PCA on the training data performs on the testing data! To reconstruct the test data, we first use the "predict" function in R to obtain the PC scores for the training data using the PCA results from the training data. We can then use the same formula we used for training set to get the reconstructed data for the testing set. Finally, we can plot these flags using ggplot to visualize the reconstructed data. ```{r, fig.height = 2, fig.width = 9, eval=TRUE, fig.align='center'} ### Project testing data ### # first predict the testing data test_data_projected = predict(pca_result, newdata = test_data) # Reconstruct the test_data from the projected data using all pcs reconstructed_test_data_allpcs = test_data_projected %*% t(pca_result$rotation) # reconstruct using 95% reconstructed_test_data_95 = test_data_projected[,1:pcs_for_95] %*% t(pca_result$rotation[,1:pcs_for_95]) # Create an empty list to store the 250x250x3 matrices matrices_250x250x3 = vector("list", length(length(test_indices))) # Generate a 250x250x3 matrix for each row in image_matrix for (i in 1:length(test_indices)) { # flattened_image = as.vector(imgs_final[[i]]) matrix_250x250x3 = array(reconstructed_test_data_allpcs[i,], dim = c(250, 250, 3)) matrices_250x250x3[[i]] = matrix_250x250x3 } # Normalize each matrix in the list flags_reconstructed_test_data_allpcs = lapply(matrices_250x250x3, normalize_matrix) # now with 95% # Create an empty list to store the 250x250x3 matrices matrices_250x250x3 = vector("list", length(length(test_indices))) # Generate a 250x250x3 matrix for each row in image_matrix for (i in 1:length(test_indices)) { # flattened_image = as.vector(imgs_final[[i]]) matrix_250x250x3 = array(reconstructed_test_data_95[i,], dim = c(250, 250, 3)) matrices_250x250x3[[i]] = matrix_250x250x3 } # Normalize each matrix in the list flags_reconstructed_test_data_95 = lapply(matrices_250x250x3, normalize_matrix) # for loop getting 5 examples for (i in 1:5) { # getting image - making cimg from imager package and then df img1 = as.data.frame(as.cimg(imgs_final[[test_indices[i]]]), wide = "c") |> # get rgb channels mutate(rgb_value = rgb(c.1, c.2, c.3)) |> # make class for facet_wrap mutate(image_type = "Original Image") # getting image - making cimg from imager package and then df img2 = as.data.frame(as.cimg(flags_reconstructed_test_data_allpcs[[i]]), wide = "c") |> # get rgb channels mutate(rgb_value = rgb(c.1, c.2, c.3)) |> # make class for facet_wrap mutate(image_type = "Fully Reconstructed Image") # getting image 1 - making cimg from imager package and then df img3 = as.data.frame(as.cimg(flags_reconstructed_test_data_95[[i]]), wide = "c") |> # get rgb channels mutate(rgb_value = rgb(c.1, c.2, c.3)) |> # make class for facet_wrap mutate(image_type = "67 PCs Reconstructed Image") # combining to all images all_images = rbind(img1, img2, img3) # make the levels of img variable what i want all_images$image_type = factor(all_images$image_type, levels = c("Original Image", "Fully Reconstructed Image","67 PCs Reconstructed Image")) # plot image print(ggplot(all_images,aes(y,x))+ # getting rgb geom_raster(aes(fill=rgb_value))+ # fill image scale_fill_identity() + # reverse y axis scale_y_reverse() + # wrap by image facet_wrap(.~image_type) + # get rid of theme theme_void() + # bigger font size theme(strip.text = element_text(size = 12))) } ``` Now you can observe that even when using all 191 PCs to reconstruct the test data, the result isn't perfect. However, this makes sense! Since PCA was not performed on the testing data it likely contains new and unseen information for the algorithm. In other words, there exists variation within this new test dataset that was not present in the training dataset. Therefore, achieving a perfectly reconstructed image is not possible. I would argue that the difference between using 191 PCs or 67 PCs in the test case isn't very significant. Particularly when comparing it to the notable differences in the training dataset, the reconstruction quality doesn't appear to be too bad. There is one interesting observation we can make regarding the last image featuring the trident in the center. It's clear that the reconstructed data was unable to accurately recreate this symbol. Instead, we see a depiction of a star in the middle image and a distinct-ish symbol in the far-right image. This suggests that there likely wasn't a similar symbol present in the training dataset, causing PCA to struggle in reconstructing it. ## Visualizing the principal components To finish up, lets take a look at the top 10 PCs in image form. To do this, we simply look at the transpose of the *rotation* variable within *pca_results*. We transform this single vector into image form via the same process as before, and visualize using ggplot.  ```{r, fig.height = 3, fig.width = 9, eval=TRUE, fig.align='center'} ### Plot PCs ### # get pcs principal_components = t(pca_result$rotation) # Create an empty list to store the 250x250x3 matrices matrices_250x250x3 = vector("list", dim(principal_components)[1]) # Generate a 250x250x3 matrix for each row in image_matrix for (i in 1:dim(principal_components)[1]) { # flattened_image = as.vector(imgs_final[[i]]) matrix_250x250x3 = array(principal_components[i,], dim = c(250, 250, 3)) matrices_250x250x3[[i]] = matrix_250x250x3 } # Normalize each matrix in the list flags_principal_components = lapply(matrices_250x250x3, normalize_matrix) # want to add all images to this dataframe for ggplot all_images = data.frame() # for loop getting 5 examples for (i in 1:10) { # getting image - making cimg from imager package and then df img = as.data.frame(as.cimg(flags_principal_components[[i]]), wide = "c") |> # get rgb channels mutate(rgb_value = rgb(c.1, c.2, c.3)) |> # make class labels mutate(img_num = paste("PC",i)) all_images = rbind(all_images, img) # make the levels of img variable what i want all_images$img_num = factor(all_images$img_num, levels = c("PC 1","PC 2","PC 3","PC 4","PC 5", "PC 6","PC 7","PC 8","PC 9","PC 10")) } # plot image print(ggplot(all_images,aes(y,x))+ # getting rgb geom_raster(aes(fill=rgb_value))+ # fill image scale_fill_identity() + # facet_wrap facet_wrap(.~ img_num, nrow = 2, ncol = 5) + # reverse y axis scale_y_reverse() + # get rid of theme theme_void() + # bigger font size theme(strip.text = element_text(size = 12))) ``` Personally, I think this is the coolest part! It's amazing to visually observe the variations captured by each PC. One of the most compelling examples that caught my attention is the comparison between PC6 and PC9. Both PCs capture the flag symbol in the top left corner and exhibit similar colors. However, PC6 splits its colors horizontally, while PC9 splits them vertically. They complement each other in a nice way, showcasing the diverse ways information is reflected across the PCs! Some other PCs, such as PC3 and PC4, have a striking resemblance to specific flag images they were trained on. On the other hand, PCs like PC7 and PC10 seem to combine elements from two or more flags in the dataset, creating cool hybrid representations. These results actually made me curious about what the appearance of the PCs beyond the top 10 would look like. So let's plot ten of them side by side and compare! This will hopefully provide further insights into the patterns and variations captured by these additional PCs. ```{r, fig.height = 3, fig.width = 9, eval=TRUE, fig.align='center'} ### Plot PCs ### # want to add all images to this dataframe for ggplot all_images = data.frame() # for loop getting 5 examples for (i in seq(11,191,20)) { # getting image - making cimg from imager package and then df img = as.data.frame(as.cimg(flags_principal_components[[i]]), wide = "c") |> # get rgb channels mutate(rgb_value = rgb(c.1, c.2, c.3)) |> # make class labels mutate(img_num = paste("PC",i)) all_images = rbind(all_images, img) # make the levels of img variable what i want all_images$img_num = factor(all_images$img_num, levels = sprintf("PC %d", seq(11, 191, 20))) } # plot image print(ggplot(all_images,aes(y,x))+ # getting rgb geom_raster(aes(fill=rgb_value))+ # fill image scale_fill_identity() + # facet_wrap facet_wrap(.~ img_num, nrow = 2, ncol = 5) + # reverse y axis scale_y_reverse() + # get rid of theme theme_void() + # bigger font size theme(strip.text = element_text(size = 12))) ``` Except for PC11, which is only one rank away from the top 10, these PCs exhibit distinct characteristics when compared to the top PCs. Rather than focusing on the overall color of the flag images, they appear to emphasize patterns within the dataset. Interestingly, the final PC essentially corresponds to a completely black image. Checking our *variance_explained* variable, we find it accounts for 7.620894e-35, which is approximately 0% of the dataset highlighting its lack of information. Additionally, many of the later PCs feature specific symbols like stars and shields, which may explain why our reconstructed test data failed to capture the trident but instead exhibited various other symbols such as a star. This serves as an important reminder that predicting new data can be challenging, particularly when it introduces novel variations or, in our case, flag symbols in images that were previously unseen by PCA. This concept holds true across most (if not all) machine learning applications, especially deep learning where the availability of comprehensive training datasets can be a major constraint. ## Final thoughts Honestly, this was a really fun project to do! It really helped me understand PCA better and how it can be utilized with images. I also was able to work on my R skills using matrix multiplication, imaging packages, and ggplot! I hope this project will help if you are struggling to understand how to do PCA on images in R. I did some searching and it was hard to find a good resource that went through step-by-step of the process... so hopefully this helped and my code is commented well enough for anyone to understand 😊 Let me know if you have any questions about the process or why I did what I did! Thanks for reading 🥳🥳 Caleb 🎲