Code
### install and read in packages
library(tidyverse) # for dataframes
library(here) # for use of others if downloaded
library(png) # for reading png images
library(magick) # for reading in images
library(imager) # for plotting
Caleb Hallinan
October 19, 2023
In this project, we will embark on an adventure using Principal Component Analysis on images of various flags of countries/regions. Hopefully by the end you will have a better understanding of PCA and how we can utilize it to look at images in R!
Note: I condensed the R code to “Code” sections throughout for scrolling purposes. Feel free to click on these sections to look at each step, or change to “Show All Code” in the top right of the webpage!
Principal Components Analysis (PCA) is a well-established technique to reduce the dimensions or features of a dataset while preserving the maximum amount of variation. This technique is widely used for pattern recognition, signal processing, and machine learning. PCA is also a useful tool in denoising, visualization, and classification of large datasets given features are linearly related. For example, imagine you have a genomics dataset with 10000+ genes and 5000+ observations. That’s a lot of data! However, not all of those genes will be important as likely many of them are correlated, or express similar variation, as others. PCA is a great tool to reduce the 10000+ dimensions to a smaller number, eliminating redundancy by reducing the dimensions of the dataset for more downstream analysis.
PCA is heavily based on linear algebra concepts. Essentially, PCA computes the eigenvectors of the covariance matrix of the data and sorts them by their eigenvalue (which correspond to the explained variance for that individual principal component). The principal components (PC) are then computed as linear combinations of the original variables using the eigenvectors. It sounds complicated, but I promise you it’s not as hard to understand as you think. This article was a great resource for me when I was first trying to figure it out. Unfortunately, I won’t go into much of the algebra behind PCA, so I encourage you to research it more! You can see this really cool animation above expressing what PCA is doing in two dimensions. Briefly, it is finding the direction that maximizes the variance of the blue dots which can also be viewed as minimizing the residuals of the blue dots to a line.
In this project specifically, instead of working with gene expressions or large datasets with numerous features, we are looking at images of flags from different countries. The goal is to apply PCA to these flags to assess its effectiveness in preserving variation and possibly discover interesting patterns or insights in the principal components themselves.
To begin our exploration, we need to install and load some R packages. These packages will equip us with the necessary tools for working with data frames, functions, file paths, reading images, and image processing.
Now that we have the necessary packages, let’s grab the flag data from here. This dataset contains flags of varying sizes within different folders, however we are going to look specifically at the “/png250px/” folder. This folder has 255 flags from various countries and regions.
### Grabbing the data
# url to flag data, it is in zip file
url = "https://github.com/hampusborgos/country-flags/archive/refs/heads/main.zip"
# specify the file name and location where you want to save the file on your computer
file_name = "flags.zip"
file_path = here()
# use the download.file() function
download.file(url, paste(file_path, file_name, sep = "/"), mode = "wb")
# unzip zip file
unzip(paste0(here(), "/flags.zip"), exdir = here())
# get file names
files = list.files(here("country-flags-main/png250px/"), full.names = TRUE)
# Read each image file in the folder
image_list = lapply(files, image_read)
# image_list = lapply(files, readPNG)
For PCA to work soundly, it is essential that the dataset’s dimensions remain consistent. Of course, that’s not the case in this dataset where we see all flags have the same height (250px) but very different widths. Hence, we need to resize each flag to be the same height and width. I use the magick R package to read in the images and the “image_scale” along with the “image_convert” functions to transform the images into size 250x250x3 (representing height, width, and color channel). These resized images are then saved in the folder “/resized_png250px/.” To enable PCA analysis, the images are converted from matrix to vector format. This involves flattening the 250x250x3 image matrix into a single vector of size 1x187500 (250 x 250 x 3). The individual vectors for each image are then combined into a single matrix of dimensions 255x187500. This variable, “image_matrix” is created and saved as flags_matrix.RDS. Also to note, this matrix is country/region flag image x pixel of flag image.
# NOTE: so each image is the same height but very different widths lol. Let's change that
# Set height and width I want
max_height = 250
max_width = 250
# Resize all the images to the specified height and width
# had to add matte=FALSE to get rid of extra channel
resized_images = lapply(image_list, function(im) {
image_convert(image_scale(im, paste0(max_width, "x", max_height, "!")), format = "png", matte=FALSE)
})
# Create the directory to save the resized images
dir.create("resized_png250px")
# Save the resized images with the same names as the original files
for (i in seq_along(resized_images)) {
# Extract the file name from the full path
file_name = basename(files[i])
# make the file path for saving the resized image
save_path = file.path("resized_png250px", file_name)
# Write the resized image to the specified file path
image_write(resized_images[[i]], save_path)
}
# now grab them with png package
resized_files = list.files(here("blog/2023-10-20-pca-on-images/resized_png250px/"), full.names = TRUE)
# use function readPNG
imgs_final = lapply(resized_files, readPNG)
# QC: check each image is same dimensions
# for (i in seq_along(imgs_final)) {
# dimensions = dim(imgs_final[[i]])
# cat("Image", i, "Dimensions:", dimensions[1], "x", dimensions[2], "x", dimensions[3], "\n")
# }
# great!
# Get the number of images in the list
num_images = length(image_list)
# Create an empty matrix to store the flattened images
image_matrix = matrix(NA, nrow = num_images, ncol = 250 * 250 * 3)
# Flatten each image and store it as a column in the matrix
for (i in 1:num_images) {
# get the flatten vector
flattened_image = as.vector(imgs_final[[i]])
# add to matrix
image_matrix[i,] = flattened_image
}
# save as .rds file
# saveRDS(image_matrix, file = here("flags_matrix.RDS"))
# plot images
# want to add all images to this dataframe for ggplot
all_images = data.frame()
# for loop getting 5 examples
for (i in seq(1,190,20)) {
# getting image - making cimg from imager package and then df
img = as.data.frame(as.cimg(imgs_final[[i]]), wide = "c") |>
# # get rgb channels
mutate(rgb_value = rgb(c.1, c.2, c.3)) |>
# # make class labels
mutate(img_num = paste("Image",i))
all_images = rbind(all_images, img)
# make the levels of img variable what i want
all_images$img_num = factor(all_images$img_num, levels = sprintf("Image %d", seq(1, 190, 20)))
}
# plot image
print(ggplot(all_images,aes(y,x))+
# getting rgb
geom_raster(aes(fill=rgb_value))+
# fill image
scale_fill_identity() +
# facet_wrap
facet_wrap(.~ img_num, nrow = 2, ncol = 5) +
# reverse y axis
scale_y_reverse() +
# get rid of theme
theme_void() +
# bigger font size
theme(strip.text = element_text(size = 12)))
Here are ten examples of the flag image data we are using.
Technically our “image_matrix” variable could be input straight into PCA the way it is. However, we are going to split the 255 images into a training and test set. Why do we do this? Well, for one it was part of the project description 😅 But really we do this to see how well PCA will predict, or project, the test data using only information from the training data. Our training set we take to be 75% of the data (191 flags) leaving 25% of data being the test images (64 flags). Finally, we are ready to conduct PCA on the training data! Note that we are setting center=TRUE and scale.=False. Briefly, the center hyperparameter shifts the data to be zero centered (subtracting the mean from each column) while scale will make the data have unit variance (correlation instead of covariance PCA). Centering is crucial for PCA to perform correctly, however scaling is dataset dependent. Take a look at the prcomp function description for more information.
### Do PCA analysis ###
# Set the seed
set.seed(123)
# Get the number of images in the list - for some reason needed to add this to this code chunk
num_images = dim(image_matrix)[1]
# Calculate the number of columns for training data
train_columns = floor(0.75 * num_images)
# Randomly select indices for the training data
train_indices = sample(1:num_images, train_columns)
# Get the test indices
test_indices = setdiff(1:num_images, train_indices)
# Split the image matrix into training and test data
train_data = image_matrix[train_indices, ]
test_data = image_matrix[-train_indices, ]
# Perform PCA on the training data, centering data but NOT scaling
pca_result = prcomp(train_data, center=TRUE, scale. = FALSE)
Let’s extract some key information form the prcomp function. We are able to calculate the proportion of variance explained from each PC by looking at the \(sdev\) value, aka the standard deviations of the principal components, by squaring them and dividing by the sum of the squared \(sdev\) values. To note, \(sdev^2\) is actually equal to the eigenvalues of the dataset! We can then get the cumulative variation as the number of PC components increase, leading to the plot you see below. Here we see that as PCs increase so does the cumulative variance preserved/explained. This is an excellent visualization to check how much each PC contributes to preserving overall variation. In our case we see that with just 67 PCs we can explain 95% of the variation in the dataset!
# Extract the proportion of variance explained by each principal component
variance_explained = pca_result$sdev^2 / sum(pca_result$sdev^2)
# Calculate the cumulative percentage of variance explained
cumulative_variance = cumsum(variance_explained) * 100
# get pcs getting more than 95% of the data
pcs_for_95 = which(cumsum(variance_explained) >= 0.95)[1]
# Create a tibble for plotting
data_plot = tibble(
# x axis
num_components = 1:length(cumulative_variance),
# cumvar
cumulative_variance = cumulative_variance
)
# plot using ggplot2
ggplot(data_plot, aes(x = num_components, y = cumulative_variance)) +
# make line plot
geom_line() +
# add points
geom_point() +
# adding line for which PC number we are getting
geom_vline(xintercept = pcs_for_95, color = "red", linetype = "dashed") +
# adding text for PC im getting
annotate("text", x = 70, y = 85, label = paste("95% variation explained\nwith", as.character(pcs_for_95), "PCs"), hjust = 0, vjust = 0, color = "black", size = 4) +
# x axis every 20
scale_x_continuous(breaks = seq(0, max(data_plot$num_components), by = 20)) +
# y axis every 20
scale_y_continuous(breaks = seq(0, 100, by = 20)) +
# labels
labs(x = "Number of Principal Components",
y = "Cumulative Variance Explained (%)",
title = "Cumulative Variance Explained with Additional Principal Components") +
# theme
theme_bw() +
# change text vars
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
axis.text = element_text(size = 10))
Another popular plot used to visualize PCs and their explained variation is a scree plot. Check out the wiki page for more information.
So we were able to successfully perform PCA on our image data… what now? Well, let’s see how well we can reconstruct the data using only the first 67 PCs, which contained 95% of the variance. To achieve this, we need to do a bit of matrix multiplication as well as add the mean back to un-center the data.
The PCA reconstructed data = the PC scores (matrix dimensions 191x191) x the transpose of the eigenvectors (matrix dimensions 191x187500) + mean
If we wanted to take only the top k PCs, which in our case I wanted the top 67, then we simply subset the PC scores and eigenvectors from 191 to 67. I then reconstruct the image matrix from the single image vector, normalize the values from 0-1, and save all the newly reconstructed image matrices in a single list.
### reconstructing training data ###
# use pca_results to reconstruct training data fully
reconstructed_train_data_allpcs = pca_result$x %*% t(pca_result$rotation)
# scale data back to center
reconstructed_train_data_allpcs = scale(reconstructed_train_data_allpcs, center = -pca_result$center, scale = FALSE)
# now use just 95% variation explained pcs
reconstructed_train_data_95 = pca_result$x[,1:pcs_for_95] %*% t(pca_result$rotation[,1:pcs_for_95])
# scale data back to center
reconstructed_train_data_95 = scale(reconstructed_train_data_95, center = -pca_result$center, scale = FALSE)
# Create an empty list to store the 250x250x3 matrices
matrices_250x250x3 = vector("list", length(length(train_indices)))
# Generate a 250x250x3 matrix for each row in image_matrix
for (i in 1:length(train_indices)) {
# flattened_image = as.vector(imgs_final[[i]])
matrix_250x250x3 = array(reconstructed_train_data_allpcs[i,], dim = c(250, 250, 3))
matrices_250x250x3[[i]] = matrix_250x250x3
}
# Function to normalize the matrix
normalize_matrix = function(matrix) {
min_value = min(matrix)
max_value = max(matrix)
normalized_matrix = (matrix - min_value) / (max_value - min_value)
return(normalized_matrix)
}
# Normalize each matrix in the list
flags_reconstructed_train_data_allpcs = lapply(matrices_250x250x3, normalize_matrix)
# now do for 95% variation
# Create an empty list to store the 250x250x3 matrices
matrices_250x250x3 = vector("list", length(length(train_indices)))
# Generate a 250x250x3 matrix for each row in image_matrix
for (i in 1:length(train_indices)) {
# flattened_image = as.vector(imgs_final[[i]])
matrix_250x250x3 = array(reconstructed_train_data_95[i,], dim = c(250, 250, 3))
matrices_250x250x3[[i]] = matrix_250x250x3
}
# Normalize each matrix in the list
flags_reconstructed_train_data_95 = lapply(matrices_250x250x3, normalize_matrix)
Now let’s view how well PCA reconstructed the data! I utilized ggplot to plot three versions of five different flags: the original resized training image, the reconstructed resized training image with all PCs, the reconstructed resized training image with 67 PCs.
# for loop getting 5 examples
for (i in 1:5) {
# getting image - making cimg from imager package and then df
img1 = as.data.frame(as.cimg(imgs_final[[train_indices[i]]]), wide = "c") |>
# get rgb channels
mutate(rgb_value = rgb(c.1, c.2, c.3)) |>
# make class for facet_wrap
mutate(image_type = "Original Image")
# getting image - making cimg from imager package and then df
img2 = as.data.frame(as.cimg(flags_reconstructed_train_data_allpcs[[i]]), wide = "c") |>
# get rgb channels
mutate(rgb_value = rgb(c.1, c.2, c.3)) |>
# make class for facet_wrap
mutate(image_type = "Fully Reconstructed Image")
# getting image 1 - making cimg from imager package and then df
img3 = as.data.frame(as.cimg(flags_reconstructed_train_data_95[[i]]), wide = "c") |>
# get rgb channels
mutate(rgb_value = rgb(c.1, c.2, c.3)) |>
# make class for facet_wrap
mutate(image_type = "67 PCs Reconstructed Image")
# combining to all images
all_images = rbind(img1, img2, img3)
# make the levels of img variable what i want
all_images$image_type = factor(all_images$image_type, levels = c("Original Image", "Fully Reconstructed Image","67 PCs Reconstructed Image"))
# plot image
print(ggplot(all_images,aes(y,x))+
# getting rgb
geom_raster(aes(fill=rgb_value))+
# fill image
scale_fill_identity() +
# reverse y axis
scale_y_reverse() +
# wrap by image
facet_wrap(.~image_type) +
# get rid of theme
theme_void() +
# bigger font size
theme(strip.text = element_text(size = 12)))
}
What do you think? Is 67 PCs enough to reconstruct the data? Since all 191 PCs were used to reconstruct the flags in the middle, we expect it to look identical to the original image as it preserves 100% of the variation. However, using only 67 PCs, or 95% of the variation of the flag images, we see more grainy images with artifacts from other flags. A lot of the reconstructed flags seems to have X’s on them along with a symbol from another flag within the dataset. If my goal was to compress these images by reducing the dimensions using PCA, I personally would use more PCs to do so. However, these reconstructed flags might be enough if one were trying to do classification or another machine learning task. So it really depends on what the goal of your project is to determine how many PCs to keep.
Now let’s see how well the PCA on the training data performs on the testing data! To reconstruct the test data, we first use the “predict” function in R to obtain the PC scores for the training data using the PCA results from the training data. We can then use the same formula we used for training set to get the reconstructed data for the testing set. Finally, we can plot these flags using ggplot to visualize the reconstructed data.
### Project testing data ###
# first predict the testing data
test_data_projected = predict(pca_result, newdata = test_data)
# Reconstruct the test_data from the projected data using all pcs
reconstructed_test_data_allpcs = test_data_projected %*% t(pca_result$rotation)
# reconstruct using 95%
reconstructed_test_data_95 = test_data_projected[,1:pcs_for_95] %*% t(pca_result$rotation[,1:pcs_for_95])
# Create an empty list to store the 250x250x3 matrices
matrices_250x250x3 = vector("list", length(length(test_indices)))
# Generate a 250x250x3 matrix for each row in image_matrix
for (i in 1:length(test_indices)) {
# flattened_image = as.vector(imgs_final[[i]])
matrix_250x250x3 = array(reconstructed_test_data_allpcs[i,], dim = c(250, 250, 3))
matrices_250x250x3[[i]] = matrix_250x250x3
}
# Normalize each matrix in the list
flags_reconstructed_test_data_allpcs = lapply(matrices_250x250x3, normalize_matrix)
# now with 95%
# Create an empty list to store the 250x250x3 matrices
matrices_250x250x3 = vector("list", length(length(test_indices)))
# Generate a 250x250x3 matrix for each row in image_matrix
for (i in 1:length(test_indices)) {
# flattened_image = as.vector(imgs_final[[i]])
matrix_250x250x3 = array(reconstructed_test_data_95[i,], dim = c(250, 250, 3))
matrices_250x250x3[[i]] = matrix_250x250x3
}
# Normalize each matrix in the list
flags_reconstructed_test_data_95 = lapply(matrices_250x250x3, normalize_matrix)
# for loop getting 5 examples
for (i in 1:5) {
# getting image - making cimg from imager package and then df
img1 = as.data.frame(as.cimg(imgs_final[[test_indices[i]]]), wide = "c") |>
# get rgb channels
mutate(rgb_value = rgb(c.1, c.2, c.3)) |>
# make class for facet_wrap
mutate(image_type = "Original Image")
# getting image - making cimg from imager package and then df
img2 = as.data.frame(as.cimg(flags_reconstructed_test_data_allpcs[[i]]), wide = "c") |>
# get rgb channels
mutate(rgb_value = rgb(c.1, c.2, c.3)) |>
# make class for facet_wrap
mutate(image_type = "Fully Reconstructed Image")
# getting image 1 - making cimg from imager package and then df
img3 = as.data.frame(as.cimg(flags_reconstructed_test_data_95[[i]]), wide = "c") |>
# get rgb channels
mutate(rgb_value = rgb(c.1, c.2, c.3)) |>
# make class for facet_wrap
mutate(image_type = "67 PCs Reconstructed Image")
# combining to all images
all_images = rbind(img1, img2, img3)
# make the levels of img variable what i want
all_images$image_type = factor(all_images$image_type, levels = c("Original Image", "Fully Reconstructed Image","67 PCs Reconstructed Image"))
# plot image
print(ggplot(all_images,aes(y,x))+
# getting rgb
geom_raster(aes(fill=rgb_value))+
# fill image
scale_fill_identity() +
# reverse y axis
scale_y_reverse() +
# wrap by image
facet_wrap(.~image_type) +
# get rid of theme
theme_void() +
# bigger font size
theme(strip.text = element_text(size = 12)))
}
Now you can observe that even when using all 191 PCs to reconstruct the test data, the result isn’t perfect. However, this makes sense! Since PCA was not performed on the testing data it likely contains new and unseen information for the algorithm. In other words, there exists variation within this new test dataset that was not present in the training dataset. Therefore, achieving a perfectly reconstructed image is not possible.
I would argue that the difference between using 191 PCs or 67 PCs in the test case isn’t very significant. Particularly when comparing it to the notable differences in the training dataset, the reconstruction quality doesn’t appear to be too bad. There is one interesting observation we can make regarding the last image featuring the trident in the center. It’s clear that the reconstructed data was unable to accurately recreate this symbol. Instead, we see a depiction of a star in the middle image and a distinct-ish symbol in the far-right image. This suggests that there likely wasn’t a similar symbol present in the training dataset, causing PCA to struggle in reconstructing it.
To finish up, lets take a look at the top 10 PCs in image form. To do this, we simply look at the transpose of the rotation variable within pca_results. We transform this single vector into image form via the same process as before, and visualize using ggplot.
### Plot PCs ###
# get pcs
principal_components = t(pca_result$rotation)
# Create an empty list to store the 250x250x3 matrices
matrices_250x250x3 = vector("list", dim(principal_components)[1])
# Generate a 250x250x3 matrix for each row in image_matrix
for (i in 1:dim(principal_components)[1]) {
# flattened_image = as.vector(imgs_final[[i]])
matrix_250x250x3 = array(principal_components[i,], dim = c(250, 250, 3))
matrices_250x250x3[[i]] = matrix_250x250x3
}
# Normalize each matrix in the list
flags_principal_components = lapply(matrices_250x250x3, normalize_matrix)
# want to add all images to this dataframe for ggplot
all_images = data.frame()
# for loop getting 5 examples
for (i in 1:10) {
# getting image - making cimg from imager package and then df
img = as.data.frame(as.cimg(flags_principal_components[[i]]), wide = "c") |>
# get rgb channels
mutate(rgb_value = rgb(c.1, c.2, c.3)) |>
# make class labels
mutate(img_num = paste("PC",i))
all_images = rbind(all_images, img)
# make the levels of img variable what i want
all_images$img_num = factor(all_images$img_num, levels = c("PC 1","PC 2","PC 3","PC 4","PC 5",
"PC 6","PC 7","PC 8","PC 9","PC 10"))
}
# plot image
print(ggplot(all_images,aes(y,x))+
# getting rgb
geom_raster(aes(fill=rgb_value))+
# fill image
scale_fill_identity() +
# facet_wrap
facet_wrap(.~ img_num, nrow = 2, ncol = 5) +
# reverse y axis
scale_y_reverse() +
# get rid of theme
theme_void() +
# bigger font size
theme(strip.text = element_text(size = 12)))
Personally, I think this is the coolest part! It’s amazing to visually observe the variations captured by each PC. One of the most compelling examples that caught my attention is the comparison between PC6 and PC9. Both PCs capture the flag symbol in the top left corner and exhibit similar colors. However, PC6 splits its colors horizontally, while PC9 splits them vertically. They complement each other in a nice way, showcasing the diverse ways information is reflected across the PCs! Some other PCs, such as PC3 and PC4, have a striking resemblance to specific flag images they were trained on. On the other hand, PCs like PC7 and PC10 seem to combine elements from two or more flags in the dataset, creating cool hybrid representations.
These results actually made me curious about what the appearance of the PCs beyond the top 10 would look like. So let’s plot ten of them side by side and compare! This will hopefully provide further insights into the patterns and variations captured by these additional PCs.
### Plot PCs ###
# want to add all images to this dataframe for ggplot
all_images = data.frame()
# for loop getting 5 examples
for (i in seq(11,191,20)) {
# getting image - making cimg from imager package and then df
img = as.data.frame(as.cimg(flags_principal_components[[i]]), wide = "c") |>
# get rgb channels
mutate(rgb_value = rgb(c.1, c.2, c.3)) |>
# make class labels
mutate(img_num = paste("PC",i))
all_images = rbind(all_images, img)
# make the levels of img variable what i want
all_images$img_num = factor(all_images$img_num, levels = sprintf("PC %d", seq(11, 191, 20)))
}
# plot image
print(ggplot(all_images,aes(y,x))+
# getting rgb
geom_raster(aes(fill=rgb_value))+
# fill image
scale_fill_identity() +
# facet_wrap
facet_wrap(.~ img_num, nrow = 2, ncol = 5) +
# reverse y axis
scale_y_reverse() +
# get rid of theme
theme_void() +
# bigger font size
theme(strip.text = element_text(size = 12)))
Except for PC11, which is only one rank away from the top 10, these PCs exhibit distinct characteristics when compared to the top PCs. Rather than focusing on the overall color of the flag images, they appear to emphasize patterns within the dataset. Interestingly, the final PC essentially corresponds to a completely black image. Checking our variance_explained variable, we find it accounts for 7.620894e-35, which is approximately 0% of the dataset highlighting its lack of information. Additionally, many of the later PCs feature specific symbols like stars and shields, which may explain why our reconstructed test data failed to capture the trident but instead exhibited various other symbols such as a star. This serves as an important reminder that predicting new data can be challenging, particularly when it introduces novel variations or, in our case, flag symbols in images that were previously unseen by PCA. This concept holds true across most (if not all) machine learning applications, especially deep learning where the availability of comprehensive training datasets can be a major constraint.
Honestly, this was a really fun project to do! It really helped me understand PCA better and how it can be utilized with images. I also was able to work on my R skills using matrix multiplication, imaging packages, and ggplot! I hope this project will help if you are struggling to understand how to do PCA on images in R. I did some searching and it was hard to find a good resource that went through step-by-step of the process… so hopefully this helped and my code is commented well enough for anyone to understand 😊 Let me know if you have any questions about the process or why I did what I did!
Thanks for reading 🥳🥳
Caleb 🎲
---
title: "Performing Principal Component Analysis on Flag Images in R"
author: "Caleb Hallinan"
date: 10/19/2023
format:
html:
code-fold: true
code-summary: "Code"
code-tools: true
embed-resources: true
fontsize: 12pt
geometry: margin=1.5in
fontcolor: white
image: preview_image.png
categories: [project]
---
<!-- Global params -->
```{r global options, include = FALSE}
knitr::opts_chunk$set(echo=TRUE, include = TRUE, warning=FALSE, message=FALSE)
```
In this project, we will embark on an adventure using Principal Component Analysis on images of various flags of countries/regions. Hopefully by the end you will have a better understanding of PCA and how we can utilize it to look at images in R!
Note: I condensed the R code to "Code" sections throughout for scrolling purposes. Feel free to click on these sections to look at each step, or change to "Show All Code" in the top right of the webpage!
## Introduction
Principal Components Analysis (PCA) is a well-established technique to reduce the dimensions or features of a dataset while preserving the maximum amount of variation. This technique is widely used for pattern recognition, signal processing, and machine learning. PCA is also a useful tool in denoising, visualization, and classification of large datasets given features are linearly related. For example, imagine you have a genomics dataset with 10000+ genes and 5000+ observations. That's a lot of data! However, not all of those genes will be important as likely many of them are correlated, or express similar variation, as others. PCA is a great tool to reduce the 10000+ dimensions to a smaller number, eliminating redundancy by reducing the dimensions of the dataset for more downstream analysis.
<!-- PCA image -->
::: {style="text-align: center;"}
![[Image Source](https://medium.com/analytics-vidhya/dimensionality-reduction-principal-component-analysis-d1402b58feb1)](pca.gif){width="600"}
:::
PCA is heavily based on linear algebra concepts. Essentially, PCA computes the eigenvectors of the covariance matrix of the data and sorts them by their eigenvalue (which correspond to the explained variance for that individual principal component). The principal components (PC) are then computed as linear combinations of the original variables using the eigenvectors. It sounds complicated, but I promise you it's not as hard to understand as you think. This [article](https://builtin.com/data-science/step-step-explanation-principal-component-analysis) was a great resource for me when I was first trying to figure it out. Unfortunately, I won't go into much of the algebra behind PCA, so I encourage you to research it more! You can see this really cool animation above expressing what PCA is doing in two dimensions. Briefly, it is finding the direction that maximizes the variance of the blue dots which can also be viewed as minimizing the residuals of the blue dots to a line.
## Data
In this project specifically, instead of working with gene expressions or large datasets with numerous features, we are looking at images of flags from different countries. The goal is to apply PCA to these flags to assess its effectiveness in preserving variation and possibly discover interesting patterns or insights in the principal components themselves.
To begin our exploration, we need to install and load some R packages. These packages will equip us with the necessary tools for working with data frames, functions, file paths, reading images, and image processing.
<!-- Install/Read packages -->
```{r, read_packages}
### install and read in packages
library(tidyverse) # for dataframes
library(here) # for use of others if downloaded
library(png) # for reading png images
library(magick) # for reading in images
library(imager) # for plotting
```
Now that we have the necessary packages, let's grab the flag data from [here](https://github.com/hampusborgos/country-flags). This dataset contains flags of varying sizes within different folders, however we are going to look specifically at the "/png250px/" folder. This folder has 255 flags from various countries and regions.
<!-- Get the Data -->
```{r, rawdata, eval=TRUE}
### Grabbing the data
# url to flag data, it is in zip file
url = "https://github.com/hampusborgos/country-flags/archive/refs/heads/main.zip"
# specify the file name and location where you want to save the file on your computer
file_name = "flags.zip"
file_path = here()
# use the download.file() function
download.file(url, paste(file_path, file_name, sep = "/"), mode = "wb")
# unzip zip file
unzip(paste0(here(), "/flags.zip"), exdir = here())
# get file names
files = list.files(here("country-flags-main/png250px/"), full.names = TRUE)
# Read each image file in the folder
image_list = lapply(files, image_read)
# image_list = lapply(files, readPNG)
```
For PCA to work soundly, it is essential that the dataset's dimensions remain consistent. Of course, that's not the case in this dataset where we see all flags have the same height (250px) but very different widths. Hence, we need to resize each flag to be the same height and width. I use the magick R package to read in the images and the "image_scale" along with the "image_convert" functions to transform the images into size 250x250x3 (representing height, width, and color channel). These resized images are then saved in the folder "/resized_png250px/." To enable PCA analysis, the images are converted from matrix to vector format. This involves flattening the 250x250x3 image matrix into a single vector of size 1x187500 (250 x 250 x 3). The individual vectors for each image are then combined into a single matrix of dimensions 255x187500. This variable, "image_matrix" is created and saved as flags_matrix.RDS. Also to note, this matrix is country/region flag image x pixel of flag image.
```{r, fig.height = 3, fig.width = 9, fig.align='center'}
# NOTE: so each image is the same height but very different widths lol. Let's change that
# Set height and width I want
max_height = 250
max_width = 250
# Resize all the images to the specified height and width
# had to add matte=FALSE to get rid of extra channel
resized_images = lapply(image_list, function(im) {
image_convert(image_scale(im, paste0(max_width, "x", max_height, "!")), format = "png", matte=FALSE)
})
# Create the directory to save the resized images
dir.create("resized_png250px")
# Save the resized images with the same names as the original files
for (i in seq_along(resized_images)) {
# Extract the file name from the full path
file_name = basename(files[i])
# make the file path for saving the resized image
save_path = file.path("resized_png250px", file_name)
# Write the resized image to the specified file path
image_write(resized_images[[i]], save_path)
}
# now grab them with png package
resized_files = list.files(here("blog/2023-10-20-pca-on-images/resized_png250px/"), full.names = TRUE)
# use function readPNG
imgs_final = lapply(resized_files, readPNG)
# QC: check each image is same dimensions
# for (i in seq_along(imgs_final)) {
# dimensions = dim(imgs_final[[i]])
# cat("Image", i, "Dimensions:", dimensions[1], "x", dimensions[2], "x", dimensions[3], "\n")
# }
# great!
# Get the number of images in the list
num_images = length(image_list)
# Create an empty matrix to store the flattened images
image_matrix = matrix(NA, nrow = num_images, ncol = 250 * 250 * 3)
# Flatten each image and store it as a column in the matrix
for (i in 1:num_images) {
# get the flatten vector
flattened_image = as.vector(imgs_final[[i]])
# add to matrix
image_matrix[i,] = flattened_image
}
# save as .rds file
# saveRDS(image_matrix, file = here("flags_matrix.RDS"))
# plot images
# want to add all images to this dataframe for ggplot
all_images = data.frame()
# for loop getting 5 examples
for (i in seq(1,190,20)) {
# getting image - making cimg from imager package and then df
img = as.data.frame(as.cimg(imgs_final[[i]]), wide = "c") |>
# # get rgb channels
mutate(rgb_value = rgb(c.1, c.2, c.3)) |>
# # make class labels
mutate(img_num = paste("Image",i))
all_images = rbind(all_images, img)
# make the levels of img variable what i want
all_images$img_num = factor(all_images$img_num, levels = sprintf("Image %d", seq(1, 190, 20)))
}
# plot image
print(ggplot(all_images,aes(y,x))+
# getting rgb
geom_raster(aes(fill=rgb_value))+
# fill image
scale_fill_identity() +
# facet_wrap
facet_wrap(.~ img_num, nrow = 2, ncol = 5) +
# reverse y axis
scale_y_reverse() +
# get rid of theme
theme_void() +
# bigger font size
theme(strip.text = element_text(size = 12)))
```
Here are ten examples of the flag image data we are using.
## PCA on the training data
Technically our "image_matrix" variable could be input straight into PCA the way it is. However, we are going to split the 255 images into a training and test set. Why do we do this? Well, for one it was part of the project description 😅 But really we do this to see how well PCA will predict, or project, the test data using only information from the training data. Our training set we take to be 75% of the data (191 flags) leaving 25% of data being the test images (64 flags). Finally, we are ready to conduct PCA on the training data! Note that we are setting center=TRUE and scale.=False. Briefly, the center hyperparameter shifts the data to be zero centered (subtracting the mean from each column) while scale will make the data have unit variance (correlation instead of covariance PCA). Centering is crucial for PCA to perform correctly, however scaling is dataset dependent. Take a look at the [prcomp function description](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/prcomp) for more information.
<!-- PCA Analysis -->
```{r}
### Do PCA analysis ###
# Set the seed
set.seed(123)
# Get the number of images in the list - for some reason needed to add this to this code chunk
num_images = dim(image_matrix)[1]
# Calculate the number of columns for training data
train_columns = floor(0.75 * num_images)
# Randomly select indices for the training data
train_indices = sample(1:num_images, train_columns)
# Get the test indices
test_indices = setdiff(1:num_images, train_indices)
# Split the image matrix into training and test data
train_data = image_matrix[train_indices, ]
test_data = image_matrix[-train_indices, ]
# Perform PCA on the training data, centering data but NOT scaling
pca_result = prcomp(train_data, center=TRUE, scale. = FALSE)
```
Let's extract some key information form the prcomp function. We are able to calculate the proportion of variance explained from each PC by looking at the $sdev$ value, aka the standard deviations of the principal components, by squaring them and dividing by the sum of the squared $sdev$ values. To note, $sdev^2$ is actually equal to the eigenvalues of the dataset! We can then get the cumulative variation as the number of PC components increase, leading to the plot you see below. Here we see that as PCs increase so does the cumulative variance preserved/explained. This is an excellent visualization to check how much each PC contributes to preserving overall variation. In our case we see that with just 67 PCs we can explain 95% of the variation in the dataset!
```{r, fig.align='center'}
# Extract the proportion of variance explained by each principal component
variance_explained = pca_result$sdev^2 / sum(pca_result$sdev^2)
# Calculate the cumulative percentage of variance explained
cumulative_variance = cumsum(variance_explained) * 100
# get pcs getting more than 95% of the data
pcs_for_95 = which(cumsum(variance_explained) >= 0.95)[1]
# Create a tibble for plotting
data_plot = tibble(
# x axis
num_components = 1:length(cumulative_variance),
# cumvar
cumulative_variance = cumulative_variance
)
# plot using ggplot2
ggplot(data_plot, aes(x = num_components, y = cumulative_variance)) +
# make line plot
geom_line() +
# add points
geom_point() +
# adding line for which PC number we are getting
geom_vline(xintercept = pcs_for_95, color = "red", linetype = "dashed") +
# adding text for PC im getting
annotate("text", x = 70, y = 85, label = paste("95% variation explained\nwith", as.character(pcs_for_95), "PCs"), hjust = 0, vjust = 0, color = "black", size = 4) +
# x axis every 20
scale_x_continuous(breaks = seq(0, max(data_plot$num_components), by = 20)) +
# y axis every 20
scale_y_continuous(breaks = seq(0, 100, by = 20)) +
# labels
labs(x = "Number of Principal Components",
y = "Cumulative Variance Explained (%)",
title = "Cumulative Variance Explained with Additional Principal Components") +
# theme
theme_bw() +
# change text vars
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
axis.text = element_text(size = 10))
```
Another popular plot used to visualize PCs and their explained variation is a scree plot. Check out the [wiki page](https://en.wikipedia.org/wiki/Scree_plot) for more information.
So we were able to successfully perform PCA on our image data... what now? Well, let's see how well we can reconstruct the data using only the first 67 PCs, which contained 95% of the variance. To achieve this, we need to do a bit of matrix multiplication as well as add the mean back to un-center the data.
The PCA reconstructed data = the PC scores (matrix dimensions 191x191) x the transpose of the eigenvectors (matrix dimensions 191x187500) + mean
If we wanted to take only the top k PCs, which in our case I wanted the top 67, then we simply subset the PC scores and eigenvectors from 191 to 67. I then reconstruct the image matrix from the single image vector, normalize the values from 0-1, and save all the newly reconstructed image matrices in a single list.
```{r}
### reconstructing training data ###
# use pca_results to reconstruct training data fully
reconstructed_train_data_allpcs = pca_result$x %*% t(pca_result$rotation)
# scale data back to center
reconstructed_train_data_allpcs = scale(reconstructed_train_data_allpcs, center = -pca_result$center, scale = FALSE)
# now use just 95% variation explained pcs
reconstructed_train_data_95 = pca_result$x[,1:pcs_for_95] %*% t(pca_result$rotation[,1:pcs_for_95])
# scale data back to center
reconstructed_train_data_95 = scale(reconstructed_train_data_95, center = -pca_result$center, scale = FALSE)
# Create an empty list to store the 250x250x3 matrices
matrices_250x250x3 = vector("list", length(length(train_indices)))
# Generate a 250x250x3 matrix for each row in image_matrix
for (i in 1:length(train_indices)) {
# flattened_image = as.vector(imgs_final[[i]])
matrix_250x250x3 = array(reconstructed_train_data_allpcs[i,], dim = c(250, 250, 3))
matrices_250x250x3[[i]] = matrix_250x250x3
}
# Function to normalize the matrix
normalize_matrix = function(matrix) {
min_value = min(matrix)
max_value = max(matrix)
normalized_matrix = (matrix - min_value) / (max_value - min_value)
return(normalized_matrix)
}
# Normalize each matrix in the list
flags_reconstructed_train_data_allpcs = lapply(matrices_250x250x3, normalize_matrix)
# now do for 95% variation
# Create an empty list to store the 250x250x3 matrices
matrices_250x250x3 = vector("list", length(length(train_indices)))
# Generate a 250x250x3 matrix for each row in image_matrix
for (i in 1:length(train_indices)) {
# flattened_image = as.vector(imgs_final[[i]])
matrix_250x250x3 = array(reconstructed_train_data_95[i,], dim = c(250, 250, 3))
matrices_250x250x3[[i]] = matrix_250x250x3
}
# Normalize each matrix in the list
flags_reconstructed_train_data_95 = lapply(matrices_250x250x3, normalize_matrix)
```
Now let's view how well PCA reconstructed the data! I utilized ggplot to plot three versions of five different flags: the original resized training image, the reconstructed resized training image with all PCs, the reconstructed resized training image with 67 PCs.
```{r, fig.height = 2, fig.width = 9, fig.align='center'}
# for loop getting 5 examples
for (i in 1:5) {
# getting image - making cimg from imager package and then df
img1 = as.data.frame(as.cimg(imgs_final[[train_indices[i]]]), wide = "c") |>
# get rgb channels
mutate(rgb_value = rgb(c.1, c.2, c.3)) |>
# make class for facet_wrap
mutate(image_type = "Original Image")
# getting image - making cimg from imager package and then df
img2 = as.data.frame(as.cimg(flags_reconstructed_train_data_allpcs[[i]]), wide = "c") |>
# get rgb channels
mutate(rgb_value = rgb(c.1, c.2, c.3)) |>
# make class for facet_wrap
mutate(image_type = "Fully Reconstructed Image")
# getting image 1 - making cimg from imager package and then df
img3 = as.data.frame(as.cimg(flags_reconstructed_train_data_95[[i]]), wide = "c") |>
# get rgb channels
mutate(rgb_value = rgb(c.1, c.2, c.3)) |>
# make class for facet_wrap
mutate(image_type = "67 PCs Reconstructed Image")
# combining to all images
all_images = rbind(img1, img2, img3)
# make the levels of img variable what i want
all_images$image_type = factor(all_images$image_type, levels = c("Original Image", "Fully Reconstructed Image","67 PCs Reconstructed Image"))
# plot image
print(ggplot(all_images,aes(y,x))+
# getting rgb
geom_raster(aes(fill=rgb_value))+
# fill image
scale_fill_identity() +
# reverse y axis
scale_y_reverse() +
# wrap by image
facet_wrap(.~image_type) +
# get rid of theme
theme_void() +
# bigger font size
theme(strip.text = element_text(size = 12)))
}
```
What do you think? Is 67 PCs enough to reconstruct the data? Since all 191 PCs were used to reconstruct the flags in the middle, we expect it to look identical to the original image as it preserves 100% of the variation. However, using only 67 PCs, or 95% of the variation of the flag images, we see more grainy images with artifacts from other flags. A lot of the reconstructed flags seems to have X's on them along with a symbol from another flag within the dataset. If my goal was to compress these images by reducing the dimensions using PCA, I personally would use more PCs to do so. However, these reconstructed flags might be enough if one were trying to do classification or another machine learning task. So it really depends on what the goal of your project is to determine how many PCs to keep.
## PCA on the test data
Now let's see how well the PCA on the training data performs on the testing data! To reconstruct the test data, we first use the "predict" function in R to obtain the PC scores for the training data using the PCA results from the training data. We can then use the same formula we used for training set to get the reconstructed data for the testing set. Finally, we can plot these flags using ggplot to visualize the reconstructed data.
```{r, fig.height = 2, fig.width = 9, eval=TRUE, fig.align='center'}
### Project testing data ###
# first predict the testing data
test_data_projected = predict(pca_result, newdata = test_data)
# Reconstruct the test_data from the projected data using all pcs
reconstructed_test_data_allpcs = test_data_projected %*% t(pca_result$rotation)
# reconstruct using 95%
reconstructed_test_data_95 = test_data_projected[,1:pcs_for_95] %*% t(pca_result$rotation[,1:pcs_for_95])
# Create an empty list to store the 250x250x3 matrices
matrices_250x250x3 = vector("list", length(length(test_indices)))
# Generate a 250x250x3 matrix for each row in image_matrix
for (i in 1:length(test_indices)) {
# flattened_image = as.vector(imgs_final[[i]])
matrix_250x250x3 = array(reconstructed_test_data_allpcs[i,], dim = c(250, 250, 3))
matrices_250x250x3[[i]] = matrix_250x250x3
}
# Normalize each matrix in the list
flags_reconstructed_test_data_allpcs = lapply(matrices_250x250x3, normalize_matrix)
# now with 95%
# Create an empty list to store the 250x250x3 matrices
matrices_250x250x3 = vector("list", length(length(test_indices)))
# Generate a 250x250x3 matrix for each row in image_matrix
for (i in 1:length(test_indices)) {
# flattened_image = as.vector(imgs_final[[i]])
matrix_250x250x3 = array(reconstructed_test_data_95[i,], dim = c(250, 250, 3))
matrices_250x250x3[[i]] = matrix_250x250x3
}
# Normalize each matrix in the list
flags_reconstructed_test_data_95 = lapply(matrices_250x250x3, normalize_matrix)
# for loop getting 5 examples
for (i in 1:5) {
# getting image - making cimg from imager package and then df
img1 = as.data.frame(as.cimg(imgs_final[[test_indices[i]]]), wide = "c") |>
# get rgb channels
mutate(rgb_value = rgb(c.1, c.2, c.3)) |>
# make class for facet_wrap
mutate(image_type = "Original Image")
# getting image - making cimg from imager package and then df
img2 = as.data.frame(as.cimg(flags_reconstructed_test_data_allpcs[[i]]), wide = "c") |>
# get rgb channels
mutate(rgb_value = rgb(c.1, c.2, c.3)) |>
# make class for facet_wrap
mutate(image_type = "Fully Reconstructed Image")
# getting image 1 - making cimg from imager package and then df
img3 = as.data.frame(as.cimg(flags_reconstructed_test_data_95[[i]]), wide = "c") |>
# get rgb channels
mutate(rgb_value = rgb(c.1, c.2, c.3)) |>
# make class for facet_wrap
mutate(image_type = "67 PCs Reconstructed Image")
# combining to all images
all_images = rbind(img1, img2, img3)
# make the levels of img variable what i want
all_images$image_type = factor(all_images$image_type, levels = c("Original Image", "Fully Reconstructed Image","67 PCs Reconstructed Image"))
# plot image
print(ggplot(all_images,aes(y,x))+
# getting rgb
geom_raster(aes(fill=rgb_value))+
# fill image
scale_fill_identity() +
# reverse y axis
scale_y_reverse() +
# wrap by image
facet_wrap(.~image_type) +
# get rid of theme
theme_void() +
# bigger font size
theme(strip.text = element_text(size = 12)))
}
```
Now you can observe that even when using all 191 PCs to reconstruct the test data, the result isn't perfect. However, this makes sense! Since PCA was not performed on the testing data it likely contains new and unseen information for the algorithm. In other words, there exists variation within this new test dataset that was not present in the training dataset. Therefore, achieving a perfectly reconstructed image is not possible.
I would argue that the difference between using 191 PCs or 67 PCs in the test case isn't very significant. Particularly when comparing it to the notable differences in the training dataset, the reconstruction quality doesn't appear to be too bad. There is one interesting observation we can make regarding the last image featuring the trident in the center. It's clear that the reconstructed data was unable to accurately recreate this symbol. Instead, we see a depiction of a star in the middle image and a distinct-ish symbol in the far-right image. This suggests that there likely wasn't a similar symbol present in the training dataset, causing PCA to struggle in reconstructing it.
## Visualizing the principal components
To finish up, lets take a look at the top 10 PCs in image form. To do this, we simply look at the transpose of the *rotation* variable within *pca_results*. We transform this single vector into image form via the same process as before, and visualize using ggplot.
<!-- looking at the Principal Components -->
```{r, fig.height = 3, fig.width = 9, eval=TRUE, fig.align='center'}
### Plot PCs ###
# get pcs
principal_components = t(pca_result$rotation)
# Create an empty list to store the 250x250x3 matrices
matrices_250x250x3 = vector("list", dim(principal_components)[1])
# Generate a 250x250x3 matrix for each row in image_matrix
for (i in 1:dim(principal_components)[1]) {
# flattened_image = as.vector(imgs_final[[i]])
matrix_250x250x3 = array(principal_components[i,], dim = c(250, 250, 3))
matrices_250x250x3[[i]] = matrix_250x250x3
}
# Normalize each matrix in the list
flags_principal_components = lapply(matrices_250x250x3, normalize_matrix)
# want to add all images to this dataframe for ggplot
all_images = data.frame()
# for loop getting 5 examples
for (i in 1:10) {
# getting image - making cimg from imager package and then df
img = as.data.frame(as.cimg(flags_principal_components[[i]]), wide = "c") |>
# get rgb channels
mutate(rgb_value = rgb(c.1, c.2, c.3)) |>
# make class labels
mutate(img_num = paste("PC",i))
all_images = rbind(all_images, img)
# make the levels of img variable what i want
all_images$img_num = factor(all_images$img_num, levels = c("PC 1","PC 2","PC 3","PC 4","PC 5",
"PC 6","PC 7","PC 8","PC 9","PC 10"))
}
# plot image
print(ggplot(all_images,aes(y,x))+
# getting rgb
geom_raster(aes(fill=rgb_value))+
# fill image
scale_fill_identity() +
# facet_wrap
facet_wrap(.~ img_num, nrow = 2, ncol = 5) +
# reverse y axis
scale_y_reverse() +
# get rid of theme
theme_void() +
# bigger font size
theme(strip.text = element_text(size = 12)))
```
Personally, I think this is the coolest part! It's amazing to visually observe the variations captured by each PC. One of the most compelling examples that caught my attention is the comparison between PC6 and PC9. Both PCs capture the flag symbol in the top left corner and exhibit similar colors. However, PC6 splits its colors horizontally, while PC9 splits them vertically. They complement each other in a nice way, showcasing the diverse ways information is reflected across the PCs! Some other PCs, such as PC3 and PC4, have a striking resemblance to specific flag images they were trained on. On the other hand, PCs like PC7 and PC10 seem to combine elements from two or more flags in the dataset, creating cool hybrid representations.
These results actually made me curious about what the appearance of the PCs beyond the top 10 would look like. So let's plot ten of them side by side and compare! This will hopefully provide further insights into the patterns and variations captured by these additional PCs.
```{r, fig.height = 3, fig.width = 9, eval=TRUE, fig.align='center'}
### Plot PCs ###
# want to add all images to this dataframe for ggplot
all_images = data.frame()
# for loop getting 5 examples
for (i in seq(11,191,20)) {
# getting image - making cimg from imager package and then df
img = as.data.frame(as.cimg(flags_principal_components[[i]]), wide = "c") |>
# get rgb channels
mutate(rgb_value = rgb(c.1, c.2, c.3)) |>
# make class labels
mutate(img_num = paste("PC",i))
all_images = rbind(all_images, img)
# make the levels of img variable what i want
all_images$img_num = factor(all_images$img_num, levels = sprintf("PC %d", seq(11, 191, 20)))
}
# plot image
print(ggplot(all_images,aes(y,x))+
# getting rgb
geom_raster(aes(fill=rgb_value))+
# fill image
scale_fill_identity() +
# facet_wrap
facet_wrap(.~ img_num, nrow = 2, ncol = 5) +
# reverse y axis
scale_y_reverse() +
# get rid of theme
theme_void() +
# bigger font size
theme(strip.text = element_text(size = 12)))
```
Except for PC11, which is only one rank away from the top 10, these PCs exhibit distinct characteristics when compared to the top PCs. Rather than focusing on the overall color of the flag images, they appear to emphasize patterns within the dataset. Interestingly, the final PC essentially corresponds to a completely black image. Checking our *variance_explained* variable, we find it accounts for 7.620894e-35, which is approximately 0% of the dataset highlighting its lack of information. Additionally, many of the later PCs feature specific symbols like stars and shields, which may explain why our reconstructed test data failed to capture the trident but instead exhibited various other symbols such as a star. This serves as an important reminder that predicting new data can be challenging, particularly when it introduces novel variations or, in our case, flag symbols in images that were previously unseen by PCA. This concept holds true across most (if not all) machine learning applications, especially deep learning where the availability of comprehensive training datasets can be a major constraint.
## Final thoughts
Honestly, this was a really fun project to do! It really helped me understand PCA better and how it can be utilized with images. I also was able to work on my R skills using matrix multiplication, imaging packages, and ggplot! I hope this project will help if you are struggling to understand how to do PCA on images in R. I did some searching and it was hard to find a good resource that went through step-by-step of the process... so hopefully this helped and my code is commented well enough for anyone to understand 😊 Let me know if you have any questions about the process or why I did what I did!
Thanks for reading 🥳🥳
Caleb 🎲