Sentiment and Topic Model Analysis of Alexandre Dumas & Others

Code

### install and read in packages

# install.packages("wordcloud") # install this for wordcloud
# install.packages("janitor") # install for sentiment analysis
# install.packages("devtools")
# devtools::install_github("ropensci/gutenbergr")
# install.packages("tm")

library(tidyverse) # for dataframes
library(devtools) # for classic functions
library(gutenbergr) # for data
library(here) # for use of others if downloaded
library(tidytext) # for sentiment analysis
library(paletteer) # for colors for plots
library(tm) # topic model package
library(topicmodels) # topic model package

Choosing an Author

Initially, the researcher expressed interest in exploring Charles Darwin’s works due to their background in Biology. However, after conducting a sentiment analysis on his works it became evident that they resembled scientific papers more so than books. As a result, the analysis on Darwin yielded less informative and exciting results than anticipated. Consequently, the focus shifted to the works of Alexandre Dumas, a celebrated fiction and fantasy novelist renowned for works such as The Three Musketeers and The Count of Monte Cristo - a classic the researcher has personally read and thoroughly enjoyed. From the extensive collection the gutenbergr package has to offer of Dumas’ books, six books were selected: The Three Musketeers, Ten Years Later, Twenty Years After, The Black Tulip, The Count of Monte Cristo, Illustrated, and the The Wolf-Leader. According to Alexandre Dumas’ Wikipedia page, his books are classified into various types of fiction. The first three books belong to The D’Artagnan Romances trilogy, while the The Count of Monte Cristo, Illustrated and The Black Tulip fall into the adventure genre. The Wolf-Leader was the final novel analyzed and was categorized as one of Dumas’ fantasy books. With this background knowledge, by performing a sentiment analysis we can not only uncover differences and similarities between individual books but also between different genres within Dumas’ literary repertoire.

Code

### Getting Dumas, Alexandre Data ###


# if file doesn't exist, download the data
if (!file.exists(here("dumas.RDS"))) {
  
  # message it wasn't found
  message("File not found, downloading now...")
  
  dumas = gutenberg_works() |>
  # group by author
  group_by(author) |>
  # filter to get dumas
  filter(author == "Dumas, Alexandre") |>
  # download data
  gutenberg_download(meta_fields = "title", strip=TRUE)
  
  # save the files to RDS objects
  saveRDS(dumas, file = here("dumas.RDS"))
  
  # message when done
  message("Finished!") 
}


# read in dumas
dumas = readRDS(here("dumas.RDS"))
# use git_ignore to not push
# usethis::use_git_ignore("dumas.RDS")


# get row numbers for dumas
dumas = dumas |> 
  # get rid of id
  select(-gutenberg_id) |>
  # get rid of lines with no text 
  filter(text != "") |> 
  # group by title
  group_by(title) |>
  # make new column
  mutate(linenumber = row_number()) |> 
  ungroup()

Sentiment Analysis

After conducting sentiment analysis on five of Alexandre Dumas’ well-known books, some interesting trends regarding his overall writing style and book-specific style emerged. Dumas’ writing exhibited a predominantly negative tone throughout, as evident by the cumulative sentiment declining constantly across all books. While this is only a small sample of his literary works, this trend is believed to be true based on our existing knowledge about Dumas.

Code

### Sentiment Analysis of Dumas, Alexandre Data ###

# sample data here to save computational effort for rest of analysis
dumas = dumas |> 
  filter(title %in% c("The Three Musketeers", "Ten Years Later", "Twenty Years After", 
                      "The Black Tulip", "The Count of Monte Cristo, Illustrated", 
                      "The Wolf-Leader"))


# tokenize author
tidy_dumas = dumas |>
  unnest_tokens(word, text)


# # check to see what the top words are
# tidy_dumas |>
#   # group by word
#   group_by(word) |>
#   # count
#   tally() |> 
#   # arrange them
#   arrange(desc(n)) 
# # NOTE: lots of the's and a's and such


# # make a word cloud for just count of monte cristo
# tt_dumas = tidy_dumas |>
#   # get count of monte
#   filter(title == "The Count of Monte Cristo, Illustrated") |>
#   # get actual count of works
#   count(word) |>
#   # arrange
#   arrange(desc(n)) |>
#   # only get 200
#   slice(1:200L)
# # make a wordcloud of it
# wordcloud::wordcloud(tt_dumas$word, tt_dumas$n)
# # NOTE: lots of the's and a's and such


# # can see words by by books
# tidy_dumas |>
#   # count
#   count(title, word) |> 
#   #arrange
#   arrange(desc(n)) |> 
#   # group by title now
#   group_by(title) |> 
#   # get a couple
#   slice(1L) 


# filter author with stop words
tidy_dumas = tidy_dumas |> 
  # get rid of stop words
  anti_join(stop_words, by = "word")


# # check with filtered author now
# tidy_dumas |>
#   count(word) |>
#   arrange(desc(n))
# # NOTE: lots of de, madame, replied, etc. that I should prob get rid of


# top words by book
top_dumas_words = tidy_dumas |>
  # count with word and group by title
  count(word, title) |>
  # arrange
  arrange(desc(n)) |> 
  # group by title
  group_by(title) 
# top_dumas_words |> slice(1:2)
# NOTE: lots of names I should get rid of


# # word cloud with no stop words
# tt_dumas = tidy_dumas |>
#   # get count of monte again
#   filter(title == "The Count of Monte Cristo, Illustrated") |> 
#   # count
#   count(word) |>
#   # arrange
#   arrange(desc(n)) |> 
#   # get 200
#   slice(1:200L) 
# # make wordcloud
# wordcloud::wordcloud(tt_dumas$word, tt_dumas$n)
# # NOTE: "count" is highest word unsurpisingly


# get bing sentiments
bing = tidytext::sentiments 
# getting dupe words from janitor package
dupes = bing |> 
  janitor::get_dupes(word) 
# get rid of dupes
bing = bing |> 
  anti_join(dupes |> filter(sentiment == "positive"))
# check
# anyDuplicated(bing$word) == 0
# NOTE: good here!


# top word sentiments with all words
# top_dumas_words |>
#   slice(1:2) |>
#   left_join(bing, by = join_by(word))


# # top word sentiments with only words with sentiment
# top_dumas_words |>
#   # get rid of drop words
#   filter(!word %in% dropwords) |> 
#   inner_join(bing, by = join_by(word)) |>
#   slice(20:30) 
#   # NOTE: checked slices 1:30


# Using this method to look at the text to determine if I should remove the word or not
# or us a regex method
# dumas |>
#   filter(str_detect(text, "ah"))

# NOTES:
# majesty: is usually "his majesty" or "your majesty" so remove all
# honor: thought it would be like "your honor" but not really so keep it
# prisoner: almost always its "the prisoner" so remove
# master: mostly his master and master (referring to person), so add
# excellency: same as your honor vibe, remove
# stranger: not really used as person as much as I would have thought, so keep
# de: found this when doing comparison to other others, just a name between names


# going to get rid of some words here
dropwords = c("majesty", "prisoner", "master", "excellency", "de")


# author sentiment
dumassentiment = tidy_dumas |> 
  # get rid of drop words
  filter(!word %in% dropwords) |> 
  # join with bing to get sentiment
  inner_join(bing, by = join_by(word)) |> 
  # TODO: what is this 80 for?
  count(title, page = linenumber %/% 80, sentiment) |> 
  # pivot wider data here
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) |> 
  # get sentiment score
  mutate(sentiment = positive - negative) 
# check head of data
# head(dumassentiment)
# NOTE: looks good!


# begin graphing


# define the desired order of the legend
desired_order <- c("The Three Musketeers", "Ten Years Later", "Twenty Years After", "The Black Tulip", 
                   "The Count of Monte Cristo, Illustrated", "The Wolf-Leader")
# define the desired colors for each title
desired_colors <- c("darkblue", "blue", "lightblue", "darkred", "red","darkgreen")


# change order for graph
dumassentiment$title = factor(dumassentiment$title, levels = desired_order)




# # ggplot of basic analysis of sentiment overtime
# ggplot(dumassentiment, aes(page, sentiment, fill = title)) + 
#   # make bar graph, remove legend cause dont need
#   geom_bar(stat = "identity", show.legend = FALSE) + 
#   # plot by book/title
#   facet_wrap(~title, ncol = 3, scales = "free_x") +
#   labs(
#     x = "Page Number",
#     y = "Sentiment Score",
#     title = "Sentiment Score Throughout Each Book",
#     caption = "Data Source: Project Gutenberg"
#   ) + 
#   theme_bw() +
#   theme(
#     text = element_text(size = 11.5),
#     plot.title = element_text(face = "bold", hjust = 0.5),
#     legend.text = element_text(face = "italic")
#     ) + 
#     scale_fill_manual(
#     values = setNames(desired_colors, desired_order),
#     breaks = desired_order
#   )


# plot the data
g = dumassentiment|> 
  # group by title
  group_by(title) |> 
  # get cumulative sentiment over time
  mutate(sentiment = cumsum(sentiment), page = page/max(page)) |> 
  # plot sentiment over time
  ggplot(aes(page, sentiment, colour = title)) + 
  # make the line width bigger
  geom_line(linewidth = 1.25) + 
  # labels
  labs(
    x = "Percent of Total Pages (%)",
    y = "Cumulative Sentiment",
    title = "Trajectory of Sentiment Throughout Dumas' Works",
    caption = "Data Source: Project Gutenberg"
  )

# making transparent legend
transparent_legend = theme(legend.background = element_rect(fill = "transparent"), legend.key = 
                             element_rect(fill = "transparent", color = "transparent"))

# plot
g + 
  # add transparent legend
  transparent_legend + 
  # change colors
  # scale_color_brewer(type = "qual") + 
  # scale_colour_manual(values = paletteer_d("ggprism::colors", 12), breaks = desired_order) +
  # make specific colors go to specific titles
  scale_colour_manual(
    values = setNames(desired_colors, desired_order),
    breaks = desired_order
  ) +
  # change x axis to percent
  scale_x_continuous(labels = scales::percent_format()) + 
  # change theme to classic
  theme_classic() + 
  # edit text
  theme(
    legend.position = c(0.25, 0.3), 
    text = element_text(size = 12),
    plot.title = element_text(face = "bold", hjust = 0.5),
    legend.text = element_text(face = "italic")
    ) + 
  # change legend postion
  guides(colour = guide_legend(title = "Book", override.aes = list(linewidth = 2)))

Within Dumas’ D’Artagnan Romances trilogy, represented by three shades of blue, The Three Musketeers had the lowest cumulative sentiment by the end of the book. The second book, Ten Years Later, contained a relatively more positive sentiment but remained negative overall. The final installment, Twenty Years After, displayed a negative cumulative sentiment similar to the first book. Hence, it appears that Dumas took readers on an emotional roller coaster from book to book while maintaining a generally negative connotation.

As for Dumas’ adventure genre books, The Count of Monte Cristo, Illustrated and The Black Tulip which are represented by shades of red, demonstrated significantly different sentiments over time. The Count of Monte Cristo, Illustrated had the lowest cumulative sentiment among all the books, while The Black Tulip approached a neutral sentiment by the end. Notably, The Count of Monte Cristo, Illustrated experienced a substantial drop in negative sentiment halfway through the book and continued its steep downward trend. Although the ending was not particularly positive, the researcher found the late and steep decline in negativity surprising given the “positive” elements (no spoilers) that are conveyed during that period if the book.

Lastly, a cumulative sentiment was performed on The Wolf-Leader, one of Dumas’ fantasy books. Despite its themes of werewolves, greed, power, and lust, the book had a slightly negative cumulative sentiment with minimal variation over time. Overall, it was somewhat surprising that the overall negative sentiment in this novel was not more pronounced.

It is important to note that certain words such as “majesty,” “prisoner,” “master,” and “excellency” were omitted from the analysis. These words were typically used as titles or names (e.g., “the prisoner” or “your excellency”) and did not significantly contribute to the books’ content. However, the exclusion of these words and the comparison between analyses with and without them did not lead to a significant deviation in the overall sentiment trend presented.

Topic Model Analysis

Next, a topic model analysis using Latent Dirichlet allocation (LDA) was conducted. The Dumas’ dataset was combined with two new authors, Aristotle and Scott F. Fitzgerald, each contributing five distinct works. Aristotle’s selected books were The Poetics of Aristotle, The Categories, Politics: A Treatise on Government, Aristotle on the art of poetry, The Athenian Constitution. Scott F. Fitzgerald’s chosen works included This Side of Paradise, Flappers and Philosophers, The Beautiful and Damned, The Great Gatsby, All the Sad Young Men.

Code

### Getting Aristotle and Fitgerald Data ###


# if file doesn't exist, download the data
if (!file.exists(here("aristotle.RDS"))) {
  
  # message it wasn't found
  message("File not found, downloading now...")
  
  aristotle = gutenberg_works() |>
  # group by author
  group_by(author) |>
  # filter to get aristotle
  filter(author == "Aristotle") |>
  # download data
  gutenberg_download(meta_fields = "title", strip=TRUE)
  
  # save the files to RDS objects
  saveRDS(aristotle, file = here("aristotle.RDS"))
  
  # message when done
  message("Finished!") 
}


# read in aristotle
aristotle = readRDS(here("aristotle.RDS"))
# use git_ignore to not push
# usethis::use_git_ignore("aristotle.RDS")


# get row numbers for dumas
aristotle = aristotle |> 
  # get rid of id
  select(-gutenberg_id) |>
  # get rid of lines with no text 
  filter(text != "") |> 
  # group by title
  group_by(title) |>
  # make new column
  mutate(linenumber = row_number()) |> 
  ungroup()


################################################


# if file doesn't exist, download the data
if (!file.exists(here("fitzgerald.RDS"))) {
  
  # message it wasn't found
  message("File not found, downloading now...")
  
  fitzgerald = gutenberg_works() |>
  # group by author
  group_by(author) |>
  # filter to get author
  filter(author == "Fitzgerald, F. Scott (Francis Scott)") |>
  # download data
  gutenberg_download(meta_fields = "title", strip=TRUE)
  
  # save the files to RDS objects
  saveRDS(fitzgerald, file = here("fitzgerald.RDS"))
  
  # message when done
  message("Finished!") 
}


# read in aristotle
fitzgerald = readRDS(here("fitzgerald.RDS"))
# use git_ignore to not push
usethis::use_git_ignore("fitzgerald.RDS")


# get row numbers for dumas
fitzgerald = fitzgerald |> 
  # get rid of id
  select(-gutenberg_id) |>
  # get rid of lines with no text 
  filter(text != "") |> 
  # group by title
  group_by(title) |>
  # make new column
  mutate(linenumber = row_number()) |> 
  ungroup()


### Clean aristotle and fitzgerald Data, also do sentiment analysis but don't print ###


# tokenize author
tidy_aristotle = aristotle |>
  unnest_tokens(word, text)


# filter author with stop words
tidy_aristotle = tidy_aristotle |> 
  # get rid of stop words
  anti_join(stop_words, by = "word")


# top words by book
top_aristotle_words = tidy_aristotle |>
  # count with word and group by title
  count(word, title) |>
  # arrange
  arrange(desc(n)) |> 
  # group by title
  group_by(title) 
# top_dumas_words |> slice(1:2)


# get bing sentiments
bing = tidytext::sentiments 
# getting dupe words from janitor package
dupes = bing |> 
  janitor::get_dupes(word) 
# get rid of dupes
bing = bing |> 
  anti_join(dupes |> filter(sentiment == "positive"))
# check
# anyDuplicated(bing$word) == 0
# NOTE: good here!


# # # top word sentiments with only words with sentiment
# top_aristotle_words |>
#   # get rid of drop words
#   filter(!word %in% dropwords) |>
#   inner_join(bing, by = join_by(word)) |>
#   slice(1:30)
#   # NOTE: checked slices 1:30


# Using this method to look at the text to determine if I should remove the word or not
# or us a regex method
# aristotle |>
#   filter(str_detect(text, "cried"))

# NOTES:
# majesty: is usually "his majesty" or "your majesty" so remove all
# honor: thought it would be like "your honor" but not really so keep it
# prisoner: almost always its "the prisoner" so remove
# master: mostly his master and master (referring to person), so add
# excellency: same as your honor vibe, remove
# stranger: not really used as person as much as I would have thought, so keep


# going to get rid of some words here
# dropwords = c("majesty", "prisoner", "master", "excellency")


# # author sentiment
# aristotlesentiment = tidy_aristotle |> 
#   # get rid of drop words
#   # filter(!word %in% dropwords) |> 
#   # join with bing to get sentiment
#   inner_join(bing, by = join_by(word)) |> 
#   # TODO: what is this 80 for?
#   count(title, page = linenumber %/% 80, sentiment) |> 
#   # pivot wider data here
#   pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) |> 
#   # get sentiment score
#   mutate(sentiment = positive - negative) 
# # check head of data
# head(aristotlesentiment)
# # NOTE: looks good!


# # ggplot of basic analysis of sentiment overtime
# ggplot(aristotlesentiment, aes(page, sentiment, fill = title)) + 
#   # make bar graph, remove legend cause dont need
#   geom_bar(stat = "identity", show.legend = FALSE) + 
#   # plot by book/title
#   facet_wrap(~title, ncol = 3, scales = "free_x") 
# 
# 
# # plot the data
# g = aristotlesentiment|> 
#   # group by title
#   group_by(title) |> 
#   # get cumulative sentiment over time
#   mutate(sentiment = cumsum(sentiment), page = page/max(page)) |> 
#   # plot sentiment over time
#   ggplot(aes(page, sentiment, colour = title)) + 
#   # make the line width bigger
#   geom_line(linewidth = 1.25) + 
#   # labels
#   labs(
#     x = "Percent of Total Pages (%)",
#     y = "Cumulative Sentiment",
#     title = "Trajectory of Sentiment Throughout Aristotle's books",
#     caption = "Data Source: Project Gutenberg"
#   )
# 
# # making transparent legend
# transparent_legend = theme(legend.background = element_rect(fill = "transparent"), legend.key = 
#                              element_rect(fill = "transparent", color = "transparent"))
# 
# # plot
# g + 
#   # add transparent legend
#   transparent_legend + 
#   # change colors
#   scale_color_brewer(type = "qual") +
#   # scale_colour_manual(values = paletteer_d("ggprism::colors", 12), breaks = desired_order) +
#   # make specific colors go to specific titles
#   # scale_colour_manual(
#   #   values = setNames(desired_colors, desired_order),
#   #   breaks = desired_order
#   # ) +
#   # change x axis to percent
#   scale_x_continuous(labels = scales::percent_format()) + 
#   # change theme to classic
#   theme_classic() + 
#   # edit text
#   theme(
#     legend.position = c(0.2, 0.75), 
#     text = element_text(size = 12),
#     plot.title = element_text(face = "bold", hjust = 0.5),
#     legend.text = element_text(face = "italic")
#     ) + 
#   # change legend postion
#   guides(colour = guide_legend(title = "Book", override.aes = list(linewidth = 2)))




### Clean fitzgerald Data, also do sentiment analysis but don't print ###



# tokenize author
tidy_fitzgerald = fitzgerald |>
  unnest_tokens(word, text)


# filter author with stop words
tidy_fitzgerald = tidy_fitzgerald |> 
  # get rid of stop words
  anti_join(stop_words, by = "word")


# top words by book
top_fitzgerald_words = tidy_fitzgerald |>
  # count with word and group by title
  count(word, title) |>
  # arrange
  arrange(desc(n)) |> 
  # group by title
  group_by(title) 
# top_fitzgerald_words |> slice(1:2)


# # # top word sentiments with only words with sentiment
# top_fitzgerald_words |>
#   # get rid of drop words
#   filter(!word %in% dropwords) |>
#   inner_join(bing, by = join_by(word)) |>
#   slice(1:30)
#   # NOTE: checked slices 1:30


# Using this method to look at the text to determine if I should remove the word or not
# or us a regex method
# fitzgerald |>
#   filter(str_detect(text, "gentlemen"))


# going to get rid of some words here
# dropwords = c("majesty", "prisoner", "master", "excellency")


# # author sentiment
# fitzgeraldsentiment = tidy_fitzgerald |> 
#   # get rid of drop words
#   # filter(!word %in% dropwords) |> 
#   # join with bing to get sentiment
#   inner_join(bing, by = join_by(word)) |> 
#   # TODO: what is this 80 for?
#   count(title, page = linenumber %/% 80, sentiment) |> 
#   # pivot wider data here
#   pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) |> 
#   # get sentiment score
#   mutate(sentiment = positive - negative) 
# # check head of data
# head(fitzgeraldsentiment)
# # NOTE: looks good!
# 
# 
# # ggplot of basic analysis of sentiment overtime
# ggplot(fitzgeraldsentiment, aes(page, sentiment, fill = title)) + 
#   # make bar graph, remove legend cause dont need
#   geom_bar(stat = "identity", show.legend = FALSE) + 
#   # plot by book/title
#   facet_wrap(~title, ncol = 3, scales = "free_x") 
# 
# 
# # plot the data
# g = fitzgeraldsentiment|> 
#   # group by title
#   group_by(title) |> 
#   # get cumulative sentiment over time
#   mutate(sentiment = cumsum(sentiment), page = page/max(page)) |> 
#   # plot sentiment over time
#   ggplot(aes(page, sentiment, colour = title)) + 
#   # make the line width bigger
#   geom_line(linewidth = 1.25) + 
#   # labels
#   labs(
#     x = "Percent of Total Pages (%)",
#     y = "Cumulative Sentiment",
#     title = "Trajectory of Sentiment Throughout Fitzgerald's books",
#     caption = "Data Source: Project Gutenberg"
#   )
# 
# # making transparent legend
# transparent_legend = theme(legend.background = element_rect(fill = "transparent"), legend.key = 
#                              element_rect(fill = "transparent", color = "transparent"))
# 
# # plot
# g + 
#   # add transparent legend
#   transparent_legend + 
#   # change colors
#   scale_color_brewer(type = "qual") +
#   # scale_colour_manual(values = paletteer_d("ggprism::colors", 12), breaks = desired_order) +
#   # make specific colors go to specific titles
#   # scale_colour_manual(
#   #   values = setNames(desired_colors, desired_order),
#   #   breaks = desired_order
#   # ) +
#   # change x axis to percent
#   scale_x_continuous(labels = scales::percent_format()) + 
#   # change theme to classic
#   theme_classic() + 
#   # edit text
#   theme(
#     legend.position = c(0.2, 0.25), 
#     text = element_text(size = 12),
#     plot.title = element_text(face = "bold", hjust = 0.5),
#     legend.text = element_text(face = "italic")
#     ) + 
#   # change legend postion
#   guides(colour = guide_legend(title = "Book", override.aes = list(linewidth = 2)))

Ultimately, the purpose of this analysis was to: 1. Differentiate between the three authors using relevant keywords when excluding names and proper nouns 2. Identify interesting patterns among authors and/or books.

For these reasons, certain words identified in earlier analyses were excluded from the current results presented. These excluded words included names within the books such as “anthony,” “gatsby,” or “dantès,” as well as common pronouns like “sir,” “madame,” and “dear.” Additionally, words such as “lord,” “queen,” and “monk” were removed because within the context of the data they typically functioned more as nouns referring to individuals.

See the “Code” section below for the full list of words excluded from the analysis.

Code

# Coming back and getting rid of words (mainly names)
words_i_dont_want = c("anthony", "gloria", "amory", "aramis", "porthos","athos","d’artagnan","count",
                      "de", "monte","cristo","villefort","danglars","madame", "morrel", "cornelius", "rosa",
                      "monsieur","dantès", "valentine", "franz", "sir", "friend", "albert","girl", "king",
                      "lord","queen","raoul","mazarin","father", "caderousse", "sire", "morcerf","majesty",
                      "milady", "friends", "cardinal", "loius", "monk", "colbert", "fouquet", "dear","daisy",
                      "tom", "gatsby", "grimaud", "planchet", "la", "tulip", "louis", "prince", "woman",
                      "duke", "mordaunt", "paris", "gentlemen", "boxtel", "baerle","rosalind", "gryphus","maury",
                      "charles", "le", "francs", "buckingham","comte", "guiche","edmond", "andrea","noirtier",
                      "malicorne", "poet", "ii", "baisemeaux", "montalais", "bonacieux", "chapter","prisoner")

LDA was first performed with three topics, with results and thoughts below.

Code

### LDA Analysis of all Authors ###


# Get bag of words
# author 1 bow
tidy_freq_dumas = tidy_dumas  |> 
  dplyr::ungroup()  |> 
  # count words
  count(title, word, name = "count") |> 
  # filter for numbers
  filter(is.na(as.numeric(word))) |> 
  # get rid of this novel
  filter(title != "The Wolf-Leader") |> 
  # retroactively get rid of these words
  filter(!word %in% words_i_dont_want)


# author 2 bow
tidy_freq_aristotle = tidy_aristotle  |> 
  dplyr::ungroup()  |> 
  # count words
  count(title, word, name = "count") |> 
  # filter for numbers
  filter(is.na(as.numeric(word))) |> 
  # retroactively get rid of these words
  filter(!word %in% words_i_dont_want)


# author 1 bow
tidy_freq_fitzgerald = tidy_fitzgerald  |> 
  dplyr::ungroup()  |> 
  # count words
  count(title, word, name = "count") |> 
  # filter for numbers
  filter(is.na(as.numeric(word))) |> 
  # retroactively get rid of these words
  filter(!word %in% words_i_dont_want)


# combine data
df_authors123 = rbind(tidy_freq_dumas, tidy_freq_aristotle, tidy_freq_fitzgerald) |> 
  # get rid of stop words
  anti_join(stop_words, by = "word") |> 
  # arrange in descending order
  arrange(desc(count))
# head(df_authors123)


# make Document Term Matrix
dtm_author <- df_authors123  |> 
  cast_dtm(title, word, count)


# Perform LDA on 3 topics
lda_author <- LDA(dtm_author, k = 3L, control = list(seed = 10))
# lda_author


# Look at words per topic
beta_author <- tidy(lda_author, matrix = "beta")
# beta_author


# look at top terms
top_terms <- beta_author  |> 
  # group by topic
  group_by(topic)  |> 
  # show top 10
  slice_max(beta, n = 15)  |>  
  ungroup()  |> 
  # arrange by lowest beta
  arrange(topic, -beta)
# top_terms


# plot top terms
top_terms |> 
  # reorder terms based on beta and topic
  mutate(term = reorder_within(term, beta, topic)) |>
  # change topic from numbers to legible lables
  mutate(topic = case_when(topic == 1 ~ "Topic 1",
                           topic == 2 ~ "Topic 2",
                           topic == 3 ~ "Topic 3")) |> 
  # begin plotting
  ggplot(aes(beta, term, fill = factor(topic))) +
  # columns, or bars for expressing data
  geom_col(show.legend = FALSE) +
  # wrao based on the topic
  facet_wrap(~ topic, scales = "free_y", nrow=2, ncol=2) +
  scale_y_reordered() +
  # labels for plot
  labs(
    x = "Word Probability Per Topic (\u03B2)",
    y = "Word",
    title = "Probability of Top Words Per Topic From a 3-Topic LDA",
    caption = "Data Source: Project Gutenberg"
  ) + 
  # change theme
  theme_linedraw() +
  # edit text and legend
  theme(
    text = element_text(size = 12),
    plot.title = element_text(face = "bold", hjust = 0.5),
    legend.text = element_text(face = "italic")
    ) +  
  # change colors of plot
    scale_fill_manual(
    values = setNames(c("darkred", "darkgreen","darkblue"), c("Topic 1", "Topic 2", "Topic 3")))

Expressed above are the top 15 words within each topic determined by a 3-Topic LDA. Note there are no names, pronouns, or other words we omitted from the analysis. Topic 1 appears to be influenced by Aristotle’s books, as it contains words like “government,” “power,” and “voice.” Topic 2, and to a lesser degree Topic 3, contain many words in the past tense such as “replied” or “cried.” This interesting find suggests that some authors may have a preference for using the past tense more frequently than others.

There are some overlaps of words between topics such as “eyes” or “time.” In future work, particularly when classifying books, it might be of interest to examine why these overlaps are present and omit them from the analysis. However, for the purpose of this study we interpret these overlaps as potential commonalities between authors’ works and possible differences in word usage given specific contexts. For instance, in Dumas’ The Black Tulip, example quotes such as ““Of a tumult?” replied Cornelius, fixing his eyes on his perplexed” and “John, with tears in his eyes, wiped off a drop of the noble blood” showcase Dumas’ use of the word “eye” to describe eye actions. Conversely, in Aristotle’s Politics: A Treatise on Government, examples like “can see better with two eyes, and hear better with two ears” and “see that absolute monarchs now furnish themselves with many eyes” demonstrate Aristotle’s usage of “eye” as a noun rather than describing its actions. Countless more examples can be found in the text of words such as this.

It is worth noting that a word like “replied” likely differs from the “eye” example, as it is predominantly used when a person is responding to someone else. This common term is used frequently in books, so it was certainly interesting to see how it varies across topic and books as shown in the figure above and below.

Code

### LDA Analysis of all Authors Gamma Plot ###

# check doc in each topic
gamma_author <- tidy(lda_author, matrix = "gamma")
# gamma_author


# get titles for each other
a1_titles = unique(tidy_dumas$title)
a2_titles = unique(tidy_aristotle$title)
a3_titles = unique(tidy_fitzgerald$title)
# order for plot
plot_order = c(a1_titles, a2_titles, a3_titles)


# plot!
gamma_author  |> 
  # make title as factor of document, get plot order correct
  mutate(title = factor(document, levels = plot_order))  |> 
  # make new column author to use for facet wrap for more legible plot
  mutate(author =  case_when(title %in% a1_titles ~ "Alexandre Dumas",
                             title %in% a2_titles ~ "Aristotle",
                             title %in% a3_titles ~ "Scott F. Fitzgerald")) |> 
  # begin plot
  ggplot(aes(x = title, y = gamma, fill = factor(topic))) +
  # facet wrao by author with same y
  facet_wrap(~author, scales = "free_x") +
  # barplots
  geom_col(width = 0.8) +
  # labels
  labs(
    x = "Book",
    y = paste("Topic Probability Per Book (\u03B3)"),
    title = "Proportion of Topics Per Book From a 3-Topic LDA",
    caption = "Data Source: Project Gutenberg",
    fill = "Topic"
  ) +
  # change theme
  theme_linedraw() +
  # edit text and legend
  theme(axis.text.x = element_text(angle = 55, hjust = 1),
        plot.title = element_text(face = "bold", hjust = 0.5),
        # legend.position = c(0.25, 0.3), 
        text = element_text(size = 12)) + 
  # change colors of fill
  scale_x_discrete(labels = function(y) str_wrap(y, width = 20), expand = c(0, 0)) +
  scale_fill_manual(values = c("darkred", "darkgreen","darkblue")) +
  # scale_fill_manual(values = paletteer_d("ggprism::colors", 12)) +
  # add some padding
  coord_cartesian(xlim = c(0.5,  5 + 0.5), expand = FALSE)

The figure above expresses how each topic was represented in the various books and highlights some differences between the authors. After removing names and pronouns, we see that each topic does not correspond exclusively to each author. However, the figure does reveal some intriguing results. Fitzgerald’s books exclusively consist only of Topic 1, Dumas’ books heavily feature Topic 2, while Aristotle’s works are a mixture of both Topics 1 and 2. This suggests how Fitzgerald and Aristotle may have more similar writing styles, or at least use more words in common, compared to Dumas. It also suggests that Dumas and Fitzgerald have little similarities in their writing style and choice of words.

Interestingly, Topic 3 resides almost only in Dumas’ The Count of Monte Cristo, Illustrated. The unique words of this topic, such as “return” and “heard,” differ entirely from the top 15 words of the other topics. This also suggests that words such as “door” and “house” are used more frequently in this specific book compared to others. Additionally, a bit of Topic 3 was located in three other books written by Dumas emphasizing some consistency within his writing.

Upon further examination of the data, it was discovered that Aristotle never mentions the word “cried” in any of his five works analyzed. However, Topic 2, which has the term “cried” as its third highest coefficient, represents a majority of two of Aristotle’s books. This suggests that including more topics will likely enhance our understanding of the underlying connections of these authors and their works. For that reason, another LDA was conducted using five topics. Note that Topics 1-3 will not be identical to the previous analysis.

Code

### LDA with more than 3 topics ###


# Perform LDA on 3 topics
lda_author <- LDA(dtm_author, k = 5L, control = list(seed = 10))
# lda_author


# Look at words per topic
beta_author <- tidy(lda_author, matrix = "beta")
# beta_author


# look at top terms
top_terms <- beta_author  |> 
  # group by topic
  group_by(topic)  |> 
  # show top 10
  slice_max(beta, n = 15)  |>  
  ungroup()  |> 
  # arrange by lowest beta
  arrange(topic, -beta)
# top_terms



# plot top terms
top_terms |> 
  # reorder terms based on beta and topic
  mutate(term = reorder_within(term, beta, topic)) |>
  # change topic from numbers to legible lables
  mutate(topic = case_when(topic == 1 ~ "Topic 1",
                           topic == 2 ~ "Topic 2",
                           topic == 3 ~ "Topic 3",
                           topic == 4 ~ "Topic 4",
                           topic == 5 ~ "Topic 5")) |> 
  # begin plotting
  ggplot(aes(beta, term, fill = factor(topic))) +
  # columns, or bars for expressing data
  geom_col(show.legend = FALSE) +
  # wrao based on the topic
  facet_wrap(~ topic, scales = "free_y", nrow=2, ncol=3) +
  scale_y_reordered() +
  # labels for plot
  labs(
    x = "Word Probability Per Topic (\u03B2)",
    y = "Word",
    title = "Probability of Top Words Per Topic From a 5-Topic LDA",
    caption = "Data Source: Project Gutenberg"
  ) + 
  # change theme
  theme_linedraw() +
  # edit text and legend
  theme(
    text = element_text(size = 12),
    plot.title = element_text(face = "bold", hjust = 0.5),
    legend.text = element_text(face = "italic"),
    axis.text.x = element_text(size = 8)
    ) +  
  # change colors of plot
    scale_fill_manual(
    values = setNames(c("darkred", "darkgreen","darkblue", "darkorange", "#4B0076"), c("Topic 1", "Topic 2", "Topic 3", "Topic 4", "Topic 5")))

A total of five topics were chosen in attempt to understand the differences between authors and their respective books. The previous analysis, using three topics, had shown to encounter an issue where certain topic words did not align with the books they were associated with. Although topic values ranging from four to eight were explored, it was determined that five topics was the most interesting and worth investigating further.

The above figure showcases the top 15 words for each of the newly generated topics produced from the LDA analysis. Topic 1 displays a strong correlation with Aristotle’s works on politics and philosophy, evident through words such as “government,” “public,” and “law.” Topics 2 and 3 contain many similar words like “time” and “cried” which were predominately related to Dumas’ books in the previous analysis. Notably, Topic 2 introduces the word “whilst,” while Topic 3 introduces the word “honor” which serve as novel distinguishing terms for differentiating between authors. In Topic 4, the word “eyes” has the highest score along with words like “night” and “day” that seemed to be connected to Fitzgerald in the last analysis. This topic also introduced the words “suddenly” and “love.” Lastly, Topic 5 was nearly identical to Topic 3 in the previous analysis, which primarily consisted of Dumas’ The Count of Monte Cristo, Illustrated.

To reemphasize, although there are many word overlaps across topics they were retained in hope to better understand the interaction of words within each topic. One interesting word that may pique the readers’ curiosity is “ah,” which has been retained in the analysis. This word was exclusively found in the works of Alexandre Dumas, as shown in the figures above. It is written in the text as instances like ““Ah! ah!” within twelve hours, you say?” and ““Ah, ah!” said William to his dog, “it’s easy to see that she is a”” which are from The Black Tulip. The decision was made not to omit this word, as it was used more frequently during this time period and aids in distinguishing between authors.

Code

### LDA Analysis of all Authors Gamma Plot ###

# check doc in each topic
gamma_author <- tidy(lda_author, matrix = "gamma")
# gamma_author


# plot!
gamma_author  |> 
  # make title as factor of document, get plot order correct
  mutate(title = factor(document, levels = plot_order))  |> 
  # make new column author to use for facet wrap for more legible plot
  mutate(author =  case_when(title %in% a1_titles ~ "Alexandre Dumas",
                             title %in% a2_titles ~ "Aristotle",
                             title %in% a3_titles ~ "Scott F. Fitzgerald")) |> 
  # begin plot
  ggplot(aes(x = title, y = gamma, fill = factor(topic))) +
  # facet wrao by author with same y
  facet_wrap(~author, scales = "free_x") +
  # barplots
  geom_col(width = 0.8) +
  # labels
  labs(
    x = "Book",
    y = paste("Topic Probability Per Book (\u03B3)"),
    title = "Proportion of Topics Per Book From a 5-Topic LDA",
    caption = "Data Source: Project Gutenberg",
    fill = "Topic"
  ) +
  # change theme
  theme_linedraw() +
  # edit text and legend
  theme(axis.text.x = element_text(angle = 55, hjust = 1),
        plot.title = element_text(face = "bold", hjust = 0.5),
        # legend.position = c(0.25, 0.3), 
        text = element_text(size = 12)) + 
  # change colors of fill
  scale_x_discrete(labels = function(y) str_wrap(y, width = 20), expand = c(0, 0)) +
  scale_fill_manual(values = c("darkred", "darkgreen","darkblue", "darkorange", "#4B0076")) +
  # scale_fill_manual(values = paletteer_d("ggprism::colors", 12)) +
  # add some padding
  coord_cartesian(xlim = c(0.5,  5 + 0.5), expand = FALSE)

The final figure above illustrates the representation of each topic across the various books while highlighting certain differences among the authors. When the number of topics was increased from three to five there was a complete separation among all authors, with Dumas exhibiting subcategories within his books.

Unsurprisingly, Aristotle was exclusively represented by Topic 1, which was not found in any other novel. This phenomena likely occured because of words within the top 15 of Topic 1 such as “government,” “democracy,” and “oligarchy,” which closely resemble the themes explored in Aristotle’s works on politics. It was fascinating that these words primarily relate to Aristotle’s Politics: A Treatise on Government and The Athenian Constitution rather than his poetic works of The Poetics of Aristotle and Aristotle on the art of poetry. Further exploration may involve identifying distinct keywords that better differentiate these bodies of work from one another.

Fitzgerald was represented solely by Topic 4 in this analysis, which was characterized by words such as “night,” “day,” “suddenly,” and “love.” Considering the selected works of this author, it comes as no surprise that these words achieve high scores for Fitzgerald. However, if words such as “woman,” “girl,” or “gentlemen,” which were excluded from the analysis, were included, the books would likely better differentiate into different topics.

Most interestingly, despite being compared to Aristotle and Fitzgerald’s works, Dumas’ five works are divided into three topics. The Count of Monte Cristo, Illustrated forms its own distinctive topic, Topic 5, corresponding exactly to Topic 3 in the previous analysis. Topic 3 was mainly found in The Three Musketeers and Ten Years Later, while Topic 4 was predominately in The Black Tulip and Twenty Years After. It was expected that the D’Artagnan Romances trilogy would fall under a single topic, and while this was mostly the case, Twenty Years After contains a significant portion of Topic 2. Topic 2 was fully associated with The Black Tulip, an adventure novel by Dumas characterized with distinguishing words such as “whilst” and “van.” Overall, it was satisfying to observe that this analysis successfully achieves its objective of distinguishing between authors and their works, while also revealing aforementioned subcategories within Dumas’ books.

Conclusion

In conclusion, the authors’ works exhibit notable differences as revealed through this analysis. Specific words unique to each author were successfully identified, indicating distinct writing styles among Dumas, Aristotle, and Fitzgerald. Moreover, the analysis successfully distinguished all three authors and discovered the previously mentioned subcategories within Dumas’ works using a five-topic LDA model. Future research could focus on identifying distinguishing words between books authored by Aristotle and Fitzgerald.